Vnexpress.net uses a very simple captcha to prevent spamming in its website. To make it clear, let's analyze one of its images:
See that if we remove the red lines and the red background, this image will become easily to be OCRed. To do that, I used
Python Imaging Library (PIL). And here is the code:
import Image
def convertBW(im):
matrix = im.load()
for x in range(im.size[0]):
for y in range(im.size[1]):
if matrix[x, y] > 0:
matrix[x, y] = 0
else:
matrix[x, y] = 255
return im
im = Image.open('image.gif')
im2 = convertBW(im)
im2.show()
im2.save('image_bw.gif', 'GIF')
After processing, the result image will be like this:
Now it's ready to use any OCR software to recognize it. I chose
Google Tesseract because it's open source and can be trained. In addition, I used its python version Pytesser at
http://code.google.com/p/pytesser/ (pytesser_v0.0.1.zip). This version contains a default trained data, which is stored in folder "tessdata". Then I tried to OCR the processed image using the default trained data, but it was not successful:
--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USIVI
--------------------------------
I used the version 2.01 (tesseract-2.01). However, this version contains some errors which prevented me from successfully training it (hope that these errors were fixed now). So I applied patches from
http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html and compliled Tesseract from source.
The training data was obtained by downloading the captcha images from vnexpress.net. I used 100 random images and combined them to create the sample (click to see it clearly)
Now it's OK to follow the tutorial for training. I had to keep the default 'tessdata' folder to have the training process done. After obtaining 4 files 'inttemp', 'pffmtable', 'unicharset', 'normproto', I added the prefix eng.* to them and created 4 other empty files 'eng.DangAmbigs', 'eng.freq-dawg', 'eng.user-words', and 'eng.word-dawg'. All these 8 files were copied to folder 'tessdata' to make a new trained data.
Let's see what I achieved :)
--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USM
--------------------------------
Please note that this entry was only written for fun. I won't take any responsibilty for any damages caused by someone who uses information from this entry for hacking purpose :D