Decoding CAPTCHA’s Handbook

Some time ago I wrote an article about Decoding CAPTCHA’s which has become what appears to be the first resource most people encounter when searching for information in the decoding CAPTCHA space.

I had continued to write about CAPTCHA’s over the years with posts scattered around the web. A while ago I started to consolidate all of my content on this blog and realised that I had considerably more CAPTCHA related articles then I thought. Some were in an unfinished or unpublished state. I had considered posting them all online but instead decided to polish it all up into a much better resource and publish it as a book.

The book is now online and available for sale. Its not what you would call a top seller but has produced enough sales to offset some hosting costs for the blog which was one goal. I also wanted to test the waters when trying to sell an info product. Info products according to Amy Hoy and Rob Walling are an excellent platform to start learning how to build recurring revenue online. Certainly one thing I have learnt is that something borderline unethical such as the book will never become a great seller. Simply put the audience is unlikely to convert to be a high sales business because of the shady nature of the content.

Anyway for those interested in purchasing a copy you can either visit the Decoding CAPTCHA’s page or use the below.

Decoding CAPTCHA's Book
Looking for a practical guide to CAPTCHA decoding? All About CAPTCHA’s. This eBook will teach you how to identify weaknesses and exploit CAPTCHA’s from beginning to end.

Buy now using LeanPub

Buy now using Gumroad

Decoding Captcha’s Presentation

A few days ago there was a lack of speakers for #SyPy which is the Sydney Python meet-up held most months and sponsored by Atlassian. I had previously put my hand up to help out if this situation ever came up and was mostly ready with a presentation about Decoding Captchas. I did not expect it to be so full that people were standing (largest crowd I had ever seen there). Thankfully it seemed to go over well and while I need to get more practice at public speaking I did enjoy it. A few choice tweets that came out of the end of the event,


Anyway you can get all the code and the slides via Decoding Captchas Bitbucket or Decoding Captchas Github.

Collection of Letters for Neural Network OCR Training

I was looking for this on Google the other day and unable to find it. Essentially what I needed was a collection of images which are all the same size, but of different fonts so that I use them for training Neural Networks and test other OCR techniques. Since I couldn’t find any I thought I would upload my own collection.

I used the below images when working on my thesis. From memory over 20 different fonts and sizes were used to create about 200 examples of each letter. The full data set proved to be pretty accurate when it came to recognizing most examples of text I found on the web.

The attached collection of images were generated using a script. It essentially just generated a number of images each which has a letter contained in it. Then another script which finds the location of the letter in the image, and crops to just that image and then resizes it to a specific size and are then saved in an appropriate directory. The full training set can be downloaded below

Collection of letters for CAPTCHA/OCR/Neural Network training.

The PHP program for generating the images is included below. All you need do is add some fonts into the referenced fonts directory and it should generate images for you.

$files1 = scandir("./fonts/");
array_splice($files1,0, 1);
array_splice($files1,0, 1);
$file1totalcount = count($files1);
$file1count = 0;
$letters = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
//$letters = "a b c d e f g h i j k l m n o p q r s t u v w x y z";
$array = explode(" ",$letters);
$number = 200;

foreach($array as $letter) {
 for($i=0;$i<$number;$i++) {
  $im = imagecreatetruecolor(500, 300);
  // Create some colors
  $white = imagecolorallocate($im, 255, 255, 255);
  $grey = imagecolorallocate($im, 128, 128, 128);
  $black = imagecolorallocate($im, 0, 0, 0);
  imagefilledrectangle($im, 0, 0, 800, 800, $black);

  $font = './fonts/'.$files1[rand(0,$file1totalcount-1)];	
  imagettftext($im, rand(15,30), 0, rand(30,200),rand(20,250), $white, $font, $letter);

List of useful CAPTCHA Decoding Articles

This website ranks quite high in most search engines for the search term “captcha decoding” or some permutation of it. As such here are a collection of useful links if you are looking into doing such a thing. If any more come up I will be sure to update this post.

Shameless self promotion but this link is why this page ranks so highly. Its an article I wrote some time ago about how to go about decoding a simple CAPTCHA. There is full source code and the principles can be applied to 90% of CAPTCHA’s out there. For the record it only came about because a colleague bet me that I couldn’t decode his websites CAPTCHA which was the one used in the article. Of course I waited till he changed it before publishing.

Interesting post on how to bypass a CAPTCHA using python. The CAPTCHA broken in this article is far more complex then most of the others in this list. Full source code is provided so its an excellent source to look at even though the article is missing a lot of details.

Another Python post about breaking CAPTCHA’s. I think that might be due to how powerful the PIL is. Has full source code. This one is worth looking at because unlike the two previous one it uses an existing OCR engine Tesseract to perform the recogniton.

This is one of the older CAPTCHA articles around and does not supply source code. It does however go into a good amount of detail about how the author looked for weaknesses in the CAPTCHA and then went about writing an algorithm to defeat it. It really is a pity the code was never released to this one.

A slightly different approach. Rather then try to code around the problem here is how to get humans to do it for you.

A PHP project that has been around since 2004 for defeating CAPTCHA’s. Code is available so its work taking a look at.

It seems the original content that went with the above posting on slashdot has disappeared but I am sure it exists somewhere else on the web. I may have a copy lying around which I will upload if I find it. Goes into detail of how to defeat the RECAPTCHA projects CAPTCHA.

This article about defeating Digg 2.0’s CAPTCHA is hopelessly out of date however it shows how easily a simple CAPTCHA can be defeated if the person creating it has little knowledge of what they are doing. I believe it ties in well with this post

This is the grandaddy of all the above posts, papers and articles. The full paper is linked in there and has far more detail. It is one of the main sources I used when I started learning about decoding CAPTCHA’s.

How reCAPTCHA Works, plus, how to cheat it, and how it contributes to the common good.

How to defeat SnapChats CAPTCHA. Fairly light on on details but provides the source code (C++) to defeat it.

Breaking the SilkRoad’s CAPTCHA. Its follow up about breaking the new SilkRoad’s CAPTCHA is worth reading as well.


Why CAPTCHA’s Never Use Number’s 0 1 5 7

Interestingly this sort of question pops up a lot in my referring search term stats.

Why CAPTCHA’s never use the numbers 0 1 5 7

Its a relativity simple question with a reasonably simple answer. Its because each of the above numbers are easy to confuse with a letter. See the below,

CAPTCHA With 0 and O

CAPTCHA With 0 and O

CAPTCHA With 0 and O

CAPTCHA With 1 and I

CAPTCHA With 5 and S

CAPTCHA With 5 and S

CAPTCHA With 7 and J L I

CAPTCHA With 7 and J L I

Are you able to tell the difference? For some yes, others, certainly not. For those wondering the first character is the number and the rest are letters. In the format “number dash letter letter”.

They all look fairly similar to a human, especially when they are warped and made fuzzy and all of the other stuff a CAPTCHA does to make OCR (Character recognition) difficult. Interestingly you can end up with the unusual situation that the CAPTCHA is easier to decode for a computer then a human when you do this since it can just churn through thousands of results get a majority right and still successfully spam a website.

The CAPTCHA used to create the images in this post can be found here Which I discovered in a comment by Mario to my own post about why you shouldnt write your own CAPTCHA’s. Its a pretty good CAPTCHA as far as CAPTCHA’s go, and I had to modify it to produce the results above. Out of the box it never displays similar text like this. If you do insist on using a CAPTCHA on your site I highly suggest having a look at it.

Why You Shouldn’t roll your own CAPTCHA

At a TechEd I attended a few years ago I was watching a presentation about Security presented by Rocky Heckman (read his blog its quite good). In it he was talking about security algorithms. The part that really stuck with me went like this,

“Don’t write your own Crypto algorithms unless you have a Doctorate in Cryptography.” Interestingly someone there did have said qualification, and Rocky had to make an exception for that single person.

None the less I think this sort of advice can be applied to all sorts of situations. In particular one that really strikes close to me heart is CAPTCHA’s. So following the words of Rocky I will make a simple statement.

“Don’t write your own CAPTCHA’s unless you have a Doctorate in Machine Vision”


A Difficult CAPTCHA to break

Now you are probably going to ask why? The reason is quite simple really. Unless you know what sort of attacks your CAPTCHA is going to experience then you don’t know how to defend against those attacks.

I’m going to pull a figure out of the air here but I would say that 90% of the home

Sample trivial CAPTCHA

A trivial CAPTCHA to break.

grown CAPTCHA’s out there on the internet are trivial to crack. Now the owners of these CAPTCHA’s will point out a reduction in spam since they implemented it as a proof of the success of their CAPTCHA but frankly thats a flawed argument. I implemented a simple CAPTCHA on another site of mine where all you have to do is enter the word “human” into a text box. Guess what? 100% spam eradication.

See the thing is, if there is money to be gained by defeating your CAPTCHA then someone out there will. Personally I have written CAPTCHA crackers for people from time to time. Guess what, most of them took less then an hour to break including time for downloading samples and tweaking to get better results.

Sample Captcha

A trivial CAPTCHA to break.

Another thing to consider is accessibility. About 99% of the home grown CAPTCHA’s out there dont even consider the fact that there are sight impaired people around who need text to speech. This becomes a huge issue in countries like England which requires that websites be accessible.

Finally its well known that you can pay people to crack a number of CAPTCHA’s for you, or even offer them porn or something and have them crack it for you without knowing.

So whats the conclusion to all of this? If you have a simple blog or website and a problem with automated spam, just add a simple “Enter the word human” text-box. It will be 100% effective, is easy to implement and won’t annoy your users. If you have something to protect and your CAPTCHA is being targeted, use an external service, which will provide a good accessible CAPTCHA that will be updated when it gets broken (which it will!). A custom CAPTCHA might seem like a good idea at the time, but its only a roadblock to someone who has any incentive to breaking in.


A trivial CAPTCHA to break.

If however you are the sort of person who looks at ReCAPTCHA and thinks “I can break that” knows when to apply Neural Networks or Support Vector Machines, knows that GIMPY is, and has post graduate studies in the field of machine vision by all means create your own CAPTCHA. Just don’t complain when you have to update it every 6 months because someone with something to gain has defeated it.

For those interested my postgrad honours thesis was on applying CAPTCHA decoding techniques against web images as a method of improving search results. You can find a simple tutorial with code about how it was done here, Decoding CAPTCHA’s