My blog on the Internet

My name is Vinh Ngoc Khuc (Khúc Ngọc Vinh in Vietnamese). I was a student at Computer Science Department of Moscow State University and now a graduate student at Computer Science and Engineering Department of The Ohio State University -- Columbus. I created this blog to record my memory and improve my English ;)

My Photo
Name:
Location: Columbus, Ohio, United States

Wednesday, June 24, 2009

How did I crack vnexpress.net's Captcha? Google Tesseract in action

Vnexpress.net uses a very simple captcha to prevent spamming in its website. To make it clear, let's analyze one of its images:


See that if we remove the red lines and the red background, this image will become easily to be OCRed. To do that, I used Python Imaging Library (PIL). And here is the code:
import Image
def convertBW(im):
matrix = im.load()
for x in range(im.size[0]):
for y in range(im.size[1]):
if matrix[x, y] > 0:
matrix[x, y] = 0
else:
matrix[x, y] = 255
return im
im = Image.open('image.gif')
im2 = convertBW(im)
im2.show()
im2.save('image_bw.gif', 'GIF')
After processing, the result image will be like this:


Now it's ready to use any OCR software to recognize it. I chose Google Tesseract because it's open source and can be trained. In addition, I used its python version Pytesser at http://code.google.com/p/pytesser/ (pytesser_v0.0.1.zip). This version contains a default trained data, which is stored in folder "tessdata". Then I tried to OCR the processed image using the default trained data, but it was not successful:

--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USIVI
--------------------------------

The letter 'M' was OCRed as 'IVI'. That's because the default trained data was created using training images with different fonts. To solve this problem, I had to train Tesseract myself. The official tutorial from Google is available at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.

I used the version 2.01 (tesseract-2.01). However, this version contains some errors which prevented me from successfully training it (hope that these errors were fixed now). So I applied patches from http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html and compliled Tesseract from source.

The training data was obtained by downloading the captcha images from vnexpress.net. I used 100 random images and combined them to create the sample (click to see it clearly)



Now it's OK to follow the tutorial for training. I had to keep the default 'tessdata' folder to have the training process done. After obtaining 4 files 'inttemp', 'pffmtable', 'unicharset', 'normproto', I added the prefix eng.* to them and created 4 other empty files 'eng.DangAmbigs', 'eng.freq-dawg', 'eng.user-words', and 'eng.word-dawg'. All these 8 files were copied to folder 'tessdata' to make a new trained data.

Let's see what I achieved :)

--------------------------------
>>> from pytesser import *
>>> image = Image.open('image_bw.gif')
>>> print image_to_string(image)
USM
--------------------------------

Please note that this entry was only written for fun. I won't take any responsibilty for any damages caused by someone who uses information from this entry for hacking purpose :D

Tuesday, June 23, 2009

Eurovision

Eurovision là cuộc thi âm nhạc được tổ chức hằng năm ở Châu Âu. Năm nay Nga đăng cai tổ chức Eurovision 2009 ở Moscow, vì đội Nga năm ngoái đã xuất sắc chiến thắng với phần biểu diễn của Dima Bilan cùng bài "Believe". Lời bài hát cũng rất ý nghĩa:



Sau khi duyệt qua các bài hát dự thi vòng chung kết Eurovision 2009 thì thấy có vẻ năm nay không hay và lạ bằng năm ngoái. Đội thắng cuộc năm nay là Na Uy với bài "Fairy Tale" do Alexander Rybak trình bày:



Ở eurovision 2008, còn có một số bài khác cũng khá hay:

* Hold on be strong, Maria Haukaas Storeng, Norway:



* Oro, Jelena Tomasevic, Serbia (qua cách biểu diễn thì hình như nói về war)

Sunday, May 31, 2009

Test mirror blog - wrote in Facebook

test

Hỗ trợ:

Image:
Link: Display Text
Rich Text: Bold Font, Italic Font, Underline Font
Font Color: Text In Color
Font Size: One Size Bigger Font, One Size Smaller Font
Paragraph:

Text


More Html Tags: Facebook supports most of html tag. Explore more yourself!

Tuesday, April 24, 2007

Google Talk is now web-based

I am really excited with new version of Google Talk in Web-based.

That is we can use Google Talk anytime, anywhere if we have Internet Connection and a web browser.

I hope Google Talk will kick Yahoo Messenger out from my computer someday.

It's too sad to say that I now must use Yahoo Messenger, a IM program with close protocol and ads pop-up, to communicate with my friends rather than a Jabber client with open protocol, more security.

So Google Talk in web-based is here:

Tuesday, March 13, 2007

How did I manage to install driver for HP deskjet 5150 in Vista

After googling arround Internet for information about installing driver for HP deskjet 5150 in Vista, I decided to go to homepage of HP for new driver for Vista. According to information from HP, my deskjet 5150 can get the driver of Deskjet 5600.

But I couldn't find any download link of driver for deskjet 5600, instead of that HP says Vista has built-in driver for deskjet 5600, and there is no need for downloading.

So, everything is clear now. Let's go:
Device Manager -> Right click hp deskjet 5150 -> Update driver -> Manually install driver, and click "Let me pick a list of device drivers on my computer" -> Printer, and click next -> HP 5600. And boom, it is installed successfully.

That is the way that I found to install driver for my hp printer in Vista. Hope HP will soon release a driver for it.

Monday, December 05, 2005

Drag & Drop in web pages

Today I found a technique that allows us to include images, which can be dragged & dropped arround the page.
That technique is located at : http://www.walterzorn.com/dragdrop/dragdrop_e.htm

Tuesday, November 01, 2005

Examples of AJAX

Using AJAX technique, a Web-site can improve its performance because as you know it doesn't have to refresh itseft to show updated information.
Some sites using AJAX:
1.http://netvibes.com/
2.http://maps.google.com/
3.http://www.google.com/ig
4.http://reader.google.com/ ( RSS Reader from Google )
5.http://start.com/ ( from Microsoft, built in Atlast )
6.http://osx.portraitofakite.com/ ( MacOS built in AJAX )
7.http://ajaxanywhere.sourceforge.net/demo.html ( comparision side-by-side between a site with AJAX and another without AJAX )
8.http://www.live.com (Windows Live)
9. http://dhtmlnirvana.com/ajax/ajax_tutorial/#

Some people don't like coding in JavaScript, so they are afraid of using AJAX in their Web sites. In the Internet also appeared some projects developing AJAX wrappers such as:
1. AJAX anywhere for JAVA :
http://ajaxanywhere.sourceforge.net/
2. AJAX.net for C#:
http://ajax.schwarz-interactive.de/csharpsample/default.aspx
3. Atlas of Microsoft:
http://beta.asp.net/default.aspx?tabindex=7&tabid=47