List of useful CAPTCHA Decoding Articles

This website ranks quite high in most search engines for the search term “captcha decoding” or some permutation of it. As such here are a collection of useful links if you are looking into doing such a thing. If any more come up I will be sure to update this post.

http://www.boyter.org/decoding-captchas/

Shameless self promotion but this link is why this page ranks so highly. Its an article I wrote some time ago about how to go about decoding a simple CAPTCHA. There is full source code and the principles can be applied to 90% of CAPTCHA’s out there. For the record it only came about because a colleague bet me that I couldn’t decode his websites CAPTCHA which was the one used in the article. Of course I waited till he changed it before publishing.

http://bokobok.fr/bypassing-a-captcha-with-python/

Interesting post on how to bypass a CAPTCHA using python. The CAPTCHA broken in this article is far more complex then most of the others in this list. Full source code is provided so its an excellent source to look at even though the article is missing a lot of details.

http://www.debasish.in/2012/01/bypass-captcha-using-python-and.html?m=1

Another Python post about breaking CAPTCHA’s. I think that might be due to how powerful the PIL is. Has full source code. This one is worth looking at because unlike the two previous one it uses an existing OCR engine Tesseract to perform the recogniton.

http://www.mperfect.net/aiCaptcha/

This is one of the older CAPTCHA articles around and does not supply source code. It does however go into a good amount of detail about how the author looked for weaknesses in the CAPTCHA and then went about writing an algorithm to defeat it. It really is a pity the code was never released to this one.

http://www.troyhunt.com/2012/01/breaking-captcha-with-automated-humans.html

A slightly different approach. Rather then try to code around the problem here is how to get humans to do it for you.

http://caca.zoy.org/wiki/PWNtcha

A PHP project that has been around since 2004 for defeating CAPTCHA’s. Code is available so its work taking a look at.

http://tech.slashdot.org/story/11/01/11/1411254/google-recaptcha-cracked
http://www.youtube.com/watch?v=dLgvrsAoPeE

It seems the original content that went with the above posting on slashdot has disappeared but I am sure it exists somewhere else on the web. I may have a copy lying around which I will upload if I find it. Goes into detail of how to defeat the RECAPTCHA projects CAPTCHA.

http://bhiv.com/defeating-diggs-captcha/

This article about defeating Digg 2.0’s CAPTCHA is hopelessly out of date however it shows how easily a simple CAPTCHA can be defeated if the person creating it has little knowledge of what they are doing. I believe it ties in well with this post http://www.boyter.org/2010/08/why-you-shouldnt-roll-your-own-captcha/

http://www.cs.sfu.ca/~mori/research/gimpy/

This is the grandaddy of all the above posts, papers and articles. The full paper is linked in there and has far more detail. It is one of the main sources I used when I started learning about decoding CAPTCHA’s.

https://medium.com/p/e8f2a748f95f

How reCAPTCHA Works, plus, how to cheat it, and how it contributes to the common good.

http://stevenhickson.blogspot.com.au/2014/01/hacking-snapchats-people-verification.html

How to defeat SnapChats CAPTCHA. Fairly light on on details but provides the source code (C++) to defeat it.

https://github.com/mieko/sr-captcha/blob/gh-pages/index.md

Breaking the SilkRoad’s CAPTCHA. Its follow up about breaking the new SilkRoad’s CAPTCHA is worth reading as well. https://github.com/mieko/sr-captcha/blob/gh-pages/silk-road-2.md

 

Want to write a search engine? Have some links

A recent comment I left on Hacker News managed to get quite a lot of up-votes which surprised me since it was effectively just a collection of links about search engines. You can read the full thread at http://news.ycombinator.com/item?id=5129530

Anyway since it did do so well I thought I would flesh it out with some more information. Here are a collection of posts/blogs/discussions which go into the details of how to write a search engine.

http://blog.algolia.com/search-ranking-algorithm-unveiled/

Algolia is a search as a service provider which has this blog post discussing the ranking algorithm they use.

http://www.yioop.com/blog.php

This one is fairly fresh and talks about building and running a general purpose search engine in PHP.

http://www.gigablast.com/rants.html

This has been defunct for a long time now but is written by Matt Wells (Gigablast and Procog) and gives a small amount of insight into the issues and problems he worked through while writing Gigablast.

http://queue.acm.org/detail.cfm?id=988407

This is probably the most famous of all search engine articles with the exception of the original Google paper. Written by Anna Patterson (Cuil) it really explores the basics of how to get a search engine up and running from crawler to indexer to serving results.

http://queue.acm.org/detail.cfm?id=988401

A fairly interesting interview with Matt Wells (Gigablast and Procog) which goes into some details of problems you will encounter running your own search engine.

http://blog.procog.com/

Sadly it appears that this has been shut down and the content is gone. This is a new blog written by Matt Wells (Gigablast) and while there isn’t much content there I have hopes for it. Matt really does know his stuff and is promoting an open algorithm to ranking so it stands to reason there will be more decent content here soon.

http://www.thebananatree.org/

This has a few articles written about creating a search engine from scratch. It appears to have been on hold for years but some of the content is worth reading. If nothing else its another view of someone starting down the search engine route.

http://blog.blekko.com/

Blekko’s engineering blog is usually interesting and covers all sorts of material applicable to search engines.

http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

This is a shameless plug but I will even suggest my own small implementation. Its essentially a walk though a group up write of a search engine in PHP. I implemented it and it worked quite well with 1 million pages.

http://infolab.stanford.edu/~backrub/google.html

The granddaddy of search papers. Its very old but outlines how the original version of Google was designed and written.

https://github.com/gigablast/open-source-search-engine

Gigablast mentioned above has since become an Open source project hosted on Github. Personally I am still yet to look through the source code but you can find how to run it on the developer page and administration page.

http://highscalability.com/blog/2013/1/28/duckduckgo-architecture-1-million-deep-searches-a-day-and-gr.html

http://highscalability.com/blog/2012/4/25/the-anatomy-of-search-technology-blekkos-nosql-database.html

http://highscalability.com/blog/2008/10/13/challenges-from-large-scale-computing-at-google.html

http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html

http://highscalability.com/blog/2011/8/29/the-three-ages-of-google-batch-warehouse-instant.html

The above are fairly interesting. The blekko one is the most technical. If you only have time to read one go with the blekko one.

http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/

Article about using Ruby to write a small scale internet search engine. Covers crawling as well as indexing using a custom indexer in MySQL.

https://blog.twitter.com/2014/building-a-complete-tweet-index

Article from twitter about indexing the full history of tweets from 2006. Of note is the information about sharding. Due to the liner nature of the data (over time) they need a way to scale across time. Worth a look.

http://www.ideaeng.com/write-search-engine-0402

The anti write a search engine. Probably worth reading though in case you feel doing so is going to be easy.

http://lucene.sourceforge.net/talks/pisa/

A talk about the internals of Lucene. Covers some design decisions and shows the architecture that Lucene uses internally.

http://alexmiller.com/the-students-guide-to-search-engines/

Not as technical as the above can be but a good primer which covers quite a lot of history. Worth a read.

Have another one I have missed here? I would love to read it. Please add a link in the comments below.

Link Love

With the fall from grace of the TWiT podcast (less Dvorak and no Calacanis makes for boring shows) I went looking for new podcast’s to keep me entertained over the last couple of months. Here are a few that I highly recommend.

Tech Podcast TechZing Live

The boys from tech-zing are full of energy, always come up with new stuff and usually manage to do one thing technical each show that makes me want to scream with frustration. All in all good stuff. They also interviewed both Calacanis and Dvorak which were probably 2 of the better shows they did.

X3

Leading on from the fact that Dvorak and Calacanis make for a good show X3 is quite the goods. Its basically Cranky Geeks reborn.

This Week In Start-ups

The whole “you ripped me off” thing aside TWIST is worth listening too. Not as tech heavy as I normally prefer in a podcast but quite good.

That’s all I have on my plate for the moment due to time constraints and that since I bought a Kindle I have spent more of my time reading these days.