My blog on the Internet

My name is Vinh Ngoc Khuc (Khúc Ngọc Vinh in Vietnamese). I was a student at Computer Science Department of Moscow State University and now a graduate student at Computer Science and Engineering Department of The Ohio State University -- Columbus. I created this blog to record my memory and improve my English ;)

My Photo
Name:
Location: Columbus, Ohio, United States

Wednesday, June 24, 2009

How did I crack vnexpress.net's Captcha? Google Tesseract in action

Vnexpress.net uses a very simple captcha to prevent spamming in its website. To make it clear, let's analyze one of its images:


See that if we remove the red lines and the red background, this image will become easily to be OCRed. To do that, I used Python Imaging Library (PIL). And here is the code:
import Image
def convertBW(im):
matrix = im.load()
for x in range(im.size[0]):
for y in range(im.size[1]):
if matrix[x, y] > 0:
matrix[x, y] = 0
else:
matrix[x, y] = 255
return im
im = Image.open('image.gif')
im2 = convertBW(im)
im2.show()
im2.save('image_bw.gif', 'GIF')
After processing, the result image will be like this:


Now it's ready to use any OCR software to recognize it. I chose Google Tesseract because it's open source and can be trained. In addition, I used its python version Pytesser at http://code.google.com/p/pytesser/ (pytesser_v0.0.1.zip). This version contains a default trained data, which is stored in folder "tessdata". Then I tried to OCR the processed image using the default trained data, but it was not successful:

--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USIVI
--------------------------------

The letter 'M' was OCRed as 'IVI'. That's because the default trained data was created using training images with different fonts. To solve this problem, I had to train Tesseract myself. The official tutorial from Google is available at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.

I used the version 2.01 (tesseract-2.01). However, this version contains some errors which prevented me from successfully training it (hope that these errors were fixed now). So I applied patches from http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html and compliled Tesseract from source.

The training data was obtained by downloading the captcha images from vnexpress.net. I used 100 random images and combined them to create the sample (click to see it clearly)



Now it's OK to follow the tutorial for training. I had to keep the default 'tessdata' folder to have the training process done. After obtaining 4 files 'inttemp', 'pffmtable', 'unicharset', 'normproto', I added the prefix eng.* to them and created 4 other empty files 'eng.DangAmbigs', 'eng.freq-dawg', 'eng.user-words', and 'eng.word-dawg'. All these 8 files were copied to folder 'tessdata' to make a new trained data.

Let's see what I achieved :)

--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USM
--------------------------------

Please note that this entry was only written for fun. I won't take any responsibilty for any damages caused by someone who uses information from this entry for hacking purpose :D

1 Comments:

Blogger Unknown said...

Hi Vinh,
Phuc day! ;) Nice job! OCR lib no' dung image descriptor gi + classification algorithm gi ta? Thuong OCR dung Shape Context (SC)/NN+SVM. That ra, muon flexible hon va accurate hon, minh cha can den cai lib nay. Minh co the dung Inner Distance Shape Context (IDSC) va` Naive-Bayes NN la du? good roi. SC is not good enough. P co' implement cai IDSC in C++ va post o site cua P do'. :)

Hom nao 2 vo chong ranh, du lich qua DC ghe' nha` P choi nha! ;)

Phuc

9:12 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home