The Kaptain on … stuff

14 Mar, 2010

Breaking Weak CAPTCHA in… slightly more than 26 Lines of Groovy Code

Posted by: TheKaptain In: Development

I read an interesting article recently about using python and open source software to defeat a particular captcha implementation and I set out to see how hard it would be to do the same in Groovy. In particular, coming from the Java side of the fence, I was impressed by how the available libraries in python made loading, mutating and saving images so easy. Admittedly I have limited experience working with image data, but when I have it has always seemed like a complex(and easy to get wrong) process. Maybe there’s a Java library out there that provides a simple ‘image_resize’ method, but it’s certainly not in the BufferedImage API. Still, when porting the 26 lines of code over to Groovy, I was able to get it considerably less verbose than the Java equivalent.

The Pretty Pictures

Here are the three images to test against. In order to put them in a suitable format for the open source tesseract-ocr program to process we need to make them bigger, remove the background noise and transform them into a ‘tif’ format. The python program we’re porting utilizes the PIL library for image handling and the pytesseract library for wrapping tesseract; I didn’t look very hard for java equivalents and just coded the required functions directly.

[singlepic id=58] [singlepic id=60] [singlepic id=59 ]

Reading in the Image

The python code for this is three lines, one to load the image and a couple more to convert it into a format suitable for directly manipulating pixel color through RGB values. Groovy takes a bit more to do the same, but being able to use a ‘with’ block makes interacting with the Graphics object a lot cleaner than the same Java code

[groovy]
//python
from PIL import Image
img = Image.open(‘input.gif’)
img = img.convert("RGBA")
pixdata = img.load()

//Groovy
BufferedImage image = ImageIO.read(new File(fileName))
BufferedImage dimg = new BufferedImage(image.width, image.height, BufferedImage.TYPE_INT_ARGB)
dimg.createGraphics().with {
setComposite(AlphaComposite.Src)
drawImage(image, null, 0, 0)
dispose()
}
[/groovy]

Removing the Background Noise

In both cases we’re doing essentially the same thing: finding all non-black pixels and setting them to white. This leaves only the actual embedded text to stand out. Being able to utilize the Java Color constants makes the Groovy version a little more readable, IMO, but otherwise the two pieces of code are generally equivalent.
[groovy]
//python
for y in xrange(img.size[1]):
for x in xrange(img.size[0]):
if pixdata[x, y] != (0, 0, 0, 255):
pixdata[x, y] = (255, 255, 255, 255)

//Groovy
(0..<dimg.height).each {i="" -="">
(0..<dimg.width).each {j="" -="">
if (dimg.getRGB(j, i) != Color.BLACK.RGB)
{
dimg.setRGB(j, i, Color.WHITE.RGB)
}
}
}
[/groovy]

Resizing the Image

Python‘s library usage really shines here, making this a one line call. Not quite the same in Java-land, although again there’s probably a better way to do this(I just don’t know it offhand).
[groovy]
//python
big = im_orig.resize((116, 56), Image.NEAREST)

//Groovy
dimg = resizeImage(dimg, 116, 56)

def resizeImage = {BufferedImage image, int w, int h -&gt;
BufferedImage dimg = new BufferedImage(w, h, image.type)
dimg.createGraphics().with {
setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR)
drawImage(image, 0, 0, w, h, 0, 0, image.width, image.height, null)
dispose()
}
return dimg
}
[/groovy]

By this point the original images now look like this, and are almost ready for OCR.

[singlepic id=61] [singlepic id=62] [singlepic id=63 ]

Converting to a tif File

This one turns out to be a bit of a PITA in Java and particularly on a Mac, and represents the bulk of the Groovy code. Unfortunately it is also the only format that tesseract appears to accept ‘out of the box’. After googling the fun that is JAI and working with the .tif(f) format with it on a Mac, I ended up taking the code kindly provided in this blog post and Groovified it a bit to make a working transformation. Thanks very much to Allan Tan for that. One more time, there’s likely a better/easier way to do this, but honestly it’s more effort than I’m willing to put in on a weekend afternoon just to satisfy my curiosity.
🙂

[groovy]
//python
ext = ".tif"
big.save("input-NEAREST" + ext)

//Groovy
void convertToTiff(String inputFile, String outputFile)
{
OutputStream ios
try
{
ios = new BufferedOutputStream(new FileOutputStream(new File(outputFile)))
ImageEncoder enc = ImageCodec.createImageEncoder("tiff", ios, new TIFFEncodeParam(compression: TIFFEncodeParam.COMPRESSION_NONE, littleEndian: false))
RenderedOp src = JAI.create("fileload", inputFile)

//Apply the color filter and return the result.
ColorConvertOp filterObj = new ColorConvertOp(ColorSpace.getInstance(ColorSpace.CS_sRGB), null)
BufferedImage dst = new BufferedImage(src.width, src.height, BufferedImage.TYPE_3BYTE_BGR)
filterObj.filter(src.getAsBufferedImage(), dst)

// save the output file
enc.encode(dst)
}
catch (Exception e)
{
println e
}
finally
{
ios.close()
}
}
[/groovy]

OCR with Tesseract-OCR

Finally we need to pass the processed image to tesseract so it can ‘read’ it for us. Again, the python library makes this a breeze, but calling out to a command line program with Groovy is so simple that it ends up being about the same. Tesseract itself is available as a macport, as well in downloadable unix binaries and a windows executable so installing the software is a breeze.
[groovy]
//python
from pytesser import *
image = Image.open(‘input-NEAREST.tif’)
print image_to_string(image)

//Groovy
def tesseract = [‘/opt/local/bin/tesseract’, tmpTif, tmpTesseract].execute()
tesseract.waitFor()
return new File("${tmpTesseract}.txt").readLines()[0]
[/groovy]

Testing it out

To test it out I implemented the code in a maven project, iterate over the images and write out intermediate results to a temp directory. And it only works on two out of three of the cases. For some reason tesseract insists on consistently seeing ‘e4ya’ as ‘e4ga’. I tried to see if I could get it working by tweaking the image manipulation parameters and the order of operations(resizing before removing the background noise for instance) but that just caused the other cases to fail as well. Since in the final image the ‘y’ seems pretty clear, it’s more likely that tweaking tesseract configuration might yield better results.
[groovy]
public void testPrintImage()
{
def breaker = new CaptchaBreaker()
/* tesseract interprets "e4ya" as "e4ga" unfortunately */
[‘9koO’, ‘jxt9’/*,’e4ya’*/].each {String imageName -&gt;
def fileName = "src/test/resources/${imageName}.gif"
assertEquals("Testing $imageName",imageName, breaker.imageToString(fileName))
}
}
[/groovy]

C’est Finis

I had some fun playing with areas of Java that I don’t usually interact with, and gained some appreciation for the diversity and ease-of-use exposed by just a couple of python libraries. It’s comforting to note that I was able to implement all of the required functionality from those libraries in < 90 lines of Groovy. With a little more effort I think the final product could be tweaked to avoid the intermediate file system reads/writes as well, but that’s for another day.
Source code is available on github if you’d care to take a look, and thanks for stopping by!

Reblog this post [with Zemanta]

9 Responses to "Breaking Weak CAPTCHA in… slightly more than 26 Lines of Groovy Code"

1 | Christian Ullenboom

March 16th, 2010 at 4:39 am

Avatar

You can even use http://code.google.com/p/tesjeract/ to get rid of the call to the executable.

2 | uberVU - social comments

March 18th, 2010 at 5:53 pm

Avatar

Social comments and analytics for this post…

This post was mentioned on Twitter by kellyrob99: New blog post: Breaking Weak CAPTCHA in… slightly more than 26 Lines of Groovy Code http://bit.ly/ajLuMq

3 | wondering if it could be done on objective C

June 1st, 2010 at 12:48 am

Avatar

Hi kellyrob99,

I am trying to do a project on iphone and other mobile phone. The project consist of getting data of bills for each user from the web server (Like electricity,water and other bills). I have a problem of captcha. when you enter get data necessary page it asks you the captcha for authentication after user name bill number e.t.c. My question is that is it possible to eliminate the captcha of beeing asked when on the login page? and would be possible convert your code to objective c or java ? Thank you.
Okayra
P.S: I am not a programer yet.

4 | TheKaptain

June 2nd, 2010 at 4:40 pm

Avatar

That’s a question to ask the individual website owners about I’m afraid. Their security measure are there explicitly to keep out the kind of ‘unknown’ robotic action you’re trying to implement, after all.
🙂

I can however tell you that there’s definitely nothing in this article that could be ported to objective c(or anywhere else) that’s going to actually defeat anything beyond the most trivial captcha anyhow. When the time comes to build the code I’m afraid you’re probably pretty much stuck either becoming a programmer or hiring one to solve this particular problem for you. Best of luck to you.

5 | Bob

January 13th, 2011 at 12:15 pm

Avatar

What did you do to properly import the classes? I have been having issues with importing classes with groovy.

7 | Bob

January 14th, 2011 at 1:05 pm

Avatar

For some reason it won’t let me reply to your response. Correct, I checked out your class and was having issues with importing those classes in it. Whats the easiest way of importing all of them? Is there an alternative to adding each one to the classpath?

8 | TheKaptain

January 14th, 2011 at 1:57 pm

Avatar

If it’s the JAI classes that are missing, you can refer to http://java.sun.com/javase/technologies/desktop/media/jai/ and http://java.sun.com/products/java-media/jai/INSTALL-1_1_2.html
Otherwise the only dependency is on Groovy, which in the sample project is provided by maven.

9 | Podcast grails.org.mx: Episodio 6 de la Temporada 1: Desarrollo web con Groovy |

November 12th, 2013 at 9:29 am

Avatar

[…] Breaking Weak CAPTCHA in… slightly more than 26 Lines of Groovy Code […]

Comment Form