The Kaptain on … stuff

14 Mar, 2010

Breaking Weak CAPTCHA in… slightly more than 26 Lines of Groovy Code

Posted by: TheKaptain In: Development

I read an interesting article recently about using python and open source software to defeat a particular captcha implementation and I set out to see how hard it would be to do the same in Groovy. In particular, coming from the Java side of the fence, I was impressed by how the available libraries in python made loading, mutating and saving images so easy. Admittedly I have limited experience working with image data, but when I have it has always seemed like a complex(and easy to get wrong) process. Maybe there’s a Java library out there that provides a simple ‘image_resize’ method, but it’s certainly not in the BufferedImage API. Still, when porting the 26 lines of code over to Groovy, I was able to get it considerably less verbose than the Java equivalent.

The Pretty Pictures

Here are the three images to test against. In order to put them in a suitable format for the open source tesseract-ocr program to process we need to make them bigger, remove the background noise and transform them into a ‘tif’ format. The python program we’re porting utilizes the PIL library for image handling and the pytesseract library for wrapping tesseract; I didn’t look very hard for java equivalents and just coded the required functions directly.

Reading in the Image

The python code for this is three lines, one to load the image and a couple more to convert it into a format suitable for directly manipulating pixel color through RGB values. Groovy takes a bit more to do the same, but being able to use a ‘with’ block makes interacting with the Graphics object a lot cleaner than the same Java code

from PIL import Image
img ='input.gif')
img = img.convert("RGBA")
pixdata = img.load()

BufferedImage image = File(fileName))
BufferedImage dimg = new BufferedImage(image.width, image.height, BufferedImage.TYPE_INT_ARGB)
dimg.createGraphics().with {
    drawImage(image, null, 0, 0)

Removing the Background Noise

In both cases we’re doing essentially the same thing: finding all non-black pixels and setting them to white. This leaves only the actual embedded text to stand out. Being able to utilize the Java Color constants makes the Groovy version a little more readable, IMO, but otherwise the two pieces of code are generally equivalent.

for y in xrange(img.size[1]):
    for x in xrange(img.size[0]):
        if pixdata[x, y] != (0, 0, 0, 255):
            pixdata[x, y] = (255, 255, 255, 255)

(0..<dimg.height).each {i="" -="">
    (0..<dimg.width).each {j="" -="">
        if (dimg.getRGB(j, i) != Color.BLACK.RGB)
            dimg.setRGB(j, i, Color.WHITE.RGB)

Resizing the Image

Python‘s library usage really shines here, making this a one line call. Not quite the same in Java-land, although again there’s probably a better way to do this(I just don’t know it offhand).

big = im_orig.resize((116, 56), Image.NEAREST)

dimg = resizeImage(dimg, 116, 56)
def resizeImage = {BufferedImage image, int w, int h -&gt;
    BufferedImage dimg = new BufferedImage(w, h, image.type)
    dimg.createGraphics().with {
        setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR)
        drawImage(image, 0, 0, w, h, 0, 0, image.width, image.height, null)
    return dimg

By this point the original images now look like this, and are almost ready for OCR.

Converting to a tif File

This one turns out to be a bit of a PITA in Java and particularly on a Mac, and represents the bulk of the Groovy code. Unfortunately it is also the only format that tesseract appears to accept ‘out of the box’. After googling the fun that is JAI and working with the .tif(f) format with it on a Mac, I ended up taking the code kindly provided in this blog post and Groovified it a bit to make a working transformation. Thanks very much to Allan Tan for that. One more time, there’s likely a better/easier way to do this, but honestly it’s more effort than I’m willing to put in on a weekend afternoon just to satisfy my curiosity.

ext = ".tif""input-NEAREST" + ext)

void convertToTiff(String inputFile, String outputFile)
    OutputStream ios
        ios = new BufferedOutputStream(new FileOutputStream(new File(outputFile)))
        ImageEncoder enc = ImageCodec.createImageEncoder("tiff", ios, new TIFFEncodeParam(compression: TIFFEncodeParam.COMPRESSION_NONE, littleEndian: false))
        RenderedOp src = JAI.create("fileload", inputFile)

        //Apply the color filter and return the result.
        ColorConvertOp filterObj = new ColorConvertOp(ColorSpace.getInstance(ColorSpace.CS_sRGB), null)
        BufferedImage dst = new BufferedImage(src.width, src.height, BufferedImage.TYPE_3BYTE_BGR)
        filterObj.filter(src.getAsBufferedImage(), dst)

        // save the output file
    catch (Exception e)
        println e

OCR with Tesseract-OCR

Finally we need to pass the processed image to tesseract so it can ‘read’ it for us. Again, the python library makes this a breeze, but calling out to a command line program with Groovy is so simple that it ends up being about the same. Tesseract itself is available as a macport, as well in downloadable unix binaries and a windows executable so installing the software is a breeze.

from pytesser import *
image ='input-NEAREST.tif')
print image_to_string(image)

def tesseract = ['/opt/local/bin/tesseract', tmpTif, tmpTesseract].execute()
return new File("${tmpTesseract}.txt").readLines()[0]

Testing it out

To test it out I implemented the code in a maven project, iterate over the images and write out intermediate results to a temp directory. And it only works on two out of three of the cases. For some reason tesseract insists on consistently seeing ‘e4ya’ as ‘e4ga’. I tried to see if I could get it working by tweaking the image manipulation parameters and the order of operations(resizing before removing the background noise for instance) but that just caused the other cases to fail as well. Since in the final image the ‘y’ seems pretty clear, it’s more likely that tweaking tesseract configuration might yield better results.

public void testPrintImage()
    def breaker = new CaptchaBreaker()
    /* tesseract interprets "e4ya" as "e4ga" unfortunately */ 
    ['9koO', 'jxt9'/*,'e4ya'*/].each {String imageName -&gt;
        def fileName = "src/test/resources/${imageName}.gif"
        assertEquals("Testing $imageName",imageName, breaker.imageToString(fileName))

C’est Finis

I had some fun playing with areas of Java that I don’t usually interact with, and gained some appreciation for the diversity and ease-of-use exposed by just a couple of python libraries. It’s comforting to note that I was able to implement all of the required functionality from those libraries in < 90 lines of Groovy. With a little more effort I think the final product could be tweaked to avoid the intermediate file system reads/writes as well, but that’s for another day.
Source code is available on github if you’d care to take a look, and thanks for stopping by!

Reblog this post [with Zemanta]
Get Adobe Flash player