My blog on the Internet: How did I crack vnexpress.net's Captcha? Google Tesseract in action

Vnexpress.net uses a very simple captcha to prevent spamming in its website. To make it clear, let's analyze one of its images:

See that if we remove the red lines and the red background, this image will become easily to be OCRed. To do that, I used Python Imaging Library (PIL). And here is the code:

import Image
def convertBW(im):
  matrix = im.load()
  for x in range(im.size[0]):
      for y in range(im.size[1]):
        if matrix[x, y] > 0:
           matrix[x, y] = 0
        else:
           matrix[x, y] = 255
  return im
im = Image.open('image.gif')
im2 = convertBW(im)
im2.show()
im2.save('image_bw.gif', 'GIF')

After processing, the result image will be like this:

Now it's ready to use any OCR software to recognize it. I chose Google Tesseract because it's open source and can be trained. In addition, I used its python version Pytesser at http://code.google.com/p/pytesser/ (pytesser_v0.0.1.zip). This version contains a default trained data, which is stored in folder "tessdata". Then I tried to OCR the processed image using the default trained data, but it was not successful:

--------------------------------

>>> from pytesser import *

>>> image = Image.open('image_bw.gif')

>>> print image_to_string(image)

USIVI

--------------------------------

The letter 'M' was OCRed as 'IVI'. That's because the default trained data was created using training images with different fonts. To solve this problem, I had to train Tesseract myself. The official tutorial from Google is available at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.

I used the version 2.01 (tesseract-2.01). However, this version contains some errors which prevented me from successfully training it (hope that these errors were fixed now). So I applied patches from http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html and compliled Tesseract from source.

The training data was obtained by downloading the captcha images from vnexpress.net. I used 100 random images and combined them to create the sample (click to see it clearly)

Now it's OK to follow the tutorial for training. I had to keep the default 'tessdata' folder to have the training process done. After obtaining 4 files 'inttemp', 'pffmtable', 'unicharset', 'normproto', I added the prefix eng.* to them and created 4 other empty files 'eng.DangAmbigs', 'eng.freq-dawg', 'eng.user-words', and 'eng.word-dawg'. All these 8 files were copied to folder 'tessdata' to make a new trained data.

Let's see what I achieved :)

--------------------------------

>>> from pytesser import *

>>> image = Image.open('image_bw.gif')

>>> print image_to_string(image)

USM

--------------------------------

Please note that this entry was only written for fun. I won't take any responsibilty for any damages caused by someone who uses information from this entry for hacking purpose :D

1 Comments:

Unknown said...: Hi Vinh,
Phuc day! ;) Nice job! OCR lib no' dung image descriptor gi + classification algorithm gi ta? Thuong OCR dung Shape Context (SC)/NN+SVM. That ra, muon flexible hon va accurate hon, minh cha can den cai lib nay. Minh co the dung Inner Distance Shape Context (IDSC) va` Naive-Bayes NN la du? good roi. SC is not good enough. P co' implement cai IDSC in C++ va post o site cua P do'. :)

Hom nao 2 vo chong ranh, du lich qua DC ghe' nha` P choi nha! ;)

Phuc; 9:12 AM

Subscribe to Post Comments [Atom]

<< Home

My blog on the Internet

About Me

Wednesday, June 24, 2009

How did I crack vnexpress.net's Captcha? Google Tesseract in action

1 Comments: