My blog on the Internet

My name is Vinh Ngoc Khuc (Khúc Ngọc Vinh in Vietnamese). I was a student at Computer Science Department of Moscow State University and now a graduate student at Computer Science and Engineering Department of The Ohio State University -- Columbus. I created this blog to record my memory and improve my English ;)

My Photo
Name:
Location: Columbus, Ohio, United States

Wednesday, June 24, 2009

How did I crack vnexpress.net's Captcha? Google Tesseract in action

Vnexpress.net uses a very simple captcha to prevent spamming in its website. To make it clear, let's analyze one of its images:


See that if we remove the red lines and the red background, this image will become easily to be OCRed. To do that, I used Python Imaging Library (PIL). And here is the code:
import Image
def convertBW(im):
matrix = im.load()
for x in range(im.size[0]):
for y in range(im.size[1]):
if matrix[x, y] > 0:
matrix[x, y] = 0
else:
matrix[x, y] = 255
return im
im = Image.open('image.gif')
im2 = convertBW(im)
im2.show()
im2.save('image_bw.gif', 'GIF')
After processing, the result image will be like this:


Now it's ready to use any OCR software to recognize it. I chose Google Tesseract because it's open source and can be trained. In addition, I used its python version Pytesser at http://code.google.com/p/pytesser/ (pytesser_v0.0.1.zip). This version contains a default trained data, which is stored in folder "tessdata". Then I tried to OCR the processed image using the default trained data, but it was not successful:

--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USIVI
--------------------------------

The letter 'M' was OCRed as 'IVI'. That's because the default trained data was created using training images with different fonts. To solve this problem, I had to train Tesseract myself. The official tutorial from Google is available at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.

I used the version 2.01 (tesseract-2.01). However, this version contains some errors which prevented me from successfully training it (hope that these errors were fixed now). So I applied patches from http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html and compliled Tesseract from source.

The training data was obtained by downloading the captcha images from vnexpress.net. I used 100 random images and combined them to create the sample (click to see it clearly)



Now it's OK to follow the tutorial for training. I had to keep the default 'tessdata' folder to have the training process done. After obtaining 4 files 'inttemp', 'pffmtable', 'unicharset', 'normproto', I added the prefix eng.* to them and created 4 other empty files 'eng.DangAmbigs', 'eng.freq-dawg', 'eng.user-words', and 'eng.word-dawg'. All these 8 files were copied to folder 'tessdata' to make a new trained data.

Let's see what I achieved :)

--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USM
--------------------------------

Please note that this entry was only written for fun. I won't take any responsibilty for any damages caused by someone who uses information from this entry for hacking purpose :D

Tuesday, June 23, 2009

Eurovision

Eurovision là cuộc thi âm nhạc được tổ chức hằng năm ở Châu Âu. Năm nay Nga đăng cai tổ chức Eurovision 2009 ở Moscow, vì đội Nga năm ngoái đã xuất sắc chiến thắng với phần biểu diễn của Dima Bilan cùng bài "Believe". Lời bài hát cũng rất ý nghĩa:



Sau khi duyệt qua các bài hát dự thi vòng chung kết Eurovision 2009 thì thấy có vẻ năm nay không hay và lạ bằng năm ngoái. Đội thắng cuộc năm nay là Na Uy với bài "Fairy Tale" do Alexander Rybak trình bày:



Ở eurovision 2008, còn có một số bài khác cũng khá hay:

* Hold on be strong, Maria Haukaas Storeng, Norway:



* Oro, Jelena Tomasevic, Serbia (qua cách biểu diễn thì hình như nói về war)