How did I crack vnexpress.net's Captcha? Google Tesseract in action
Vnexpress.net uses a very simple captcha to prevent spamming in its website. To make it clear, let's analyze one of its images:
See that if we remove the red lines and the red background, this image will become easily to be OCRed. To do that, I used Python Imaging Library (PIL). And here is the code:
import Image
def convertBW(im):
matrix = im.load()
for x in range(im.size[0]):
for y in range(im.size[1]):
if matrix[x, y] > 0:
matrix[x, y] = 0
else:
matrix[x, y] = 255
return im
im = Image.open('image.gif')
im2 = convertBW(im)
im2.show()
im2.save('image_bw.gif', 'GIF')
After processing, the result image will be like this:
Now it's ready to use any OCR software to recognize it. I chose Google Tesseract because it's open source and can be trained. In addition, I used its python version Pytesser at http://code.google.com/p/pytesser/ (pytesser_v0.0.1.zip). This version contains a default trained data, which is stored in folder "tessdata". Then I tried to OCR the processed image using the default trained data, but it was not successful:
--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USIVI
--------------------------------
The letter 'M' was OCRed as 'IVI'. That's because the default trained data was created using training images with different fonts. To solve this problem, I had to train Tesseract myself. The official tutorial from Google is available at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.
I used the version 2.01 (tesseract-2.01). However, this version contains some errors which prevented me from successfully training it (hope that these errors were fixed now). So I applied patches from http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html and compliled Tesseract from source.
The training data was obtained by downloading the captcha images from vnexpress.net. I used 100 random images and combined them to create the sample (click to see it clearly)
Now it's OK to follow the tutorial for training. I had to keep the default 'tessdata' folder to have the training process done. After obtaining 4 files 'inttemp', 'pffmtable', 'unicharset', 'normproto', I added the prefix eng.* to them and created 4 other empty files 'eng.DangAmbigs', 'eng.freq-dawg', 'eng.user-words', and 'eng.word-dawg'. All these 8 files were copied to folder 'tessdata' to make a new trained data.
Let's see what I achieved :)
--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USM
--------------------------------
Please note that this entry was only written for fun. I won't take any responsibilty for any damages caused by someone who uses information from this entry for hacking purpose :D
1 Comments:
Hi Vinh,
Phuc day! ;) Nice job! OCR lib no' dung image descriptor gi + classification algorithm gi ta? Thuong OCR dung Shape Context (SC)/NN+SVM. That ra, muon flexible hon va accurate hon, minh cha can den cai lib nay. Minh co the dung Inner Distance Shape Context (IDSC) va` Naive-Bayes NN la du? good roi. SC is not good enough. P co' implement cai IDSC in C++ va post o site cua P do'. :)
Hom nao 2 vo chong ranh, du lich qua DC ghe' nha` P choi nha! ;)
Phuc
Post a Comment
Subscribe to Post Comments [Atom]
<< Home