searchcode local

I am going to copy the searchcode pitch itself below quickly before explaining it a bit further.

“searchcode offers powerful code search over billions of lines of open source code. Imagine what it could do with your private repositories.

There have been requests to offer a downloadable version of searchcode. Given enough interest a downloadable hostable version of searchcode will be offered. Register your email below to register your interest.

Note that there would be a free Community version available for all users as well as paid version offering support. Functionality would remain the same across all versions. This would be similar to how Octopus Deploy is offered.”

In short I am considering writing a hostable version of searchcode. Most likely it would consist of a Java application one could download and use to get similar results to itself (probably at smaller scale however).

Rather then actually commit to several months worth of work however I have put a message on searchcode asking for those interested to register their interest. If it sounds like something you would like please register.

I have no signup target numbers in mind or product costs etc… but I suspect given over 100 sign-ups I will actually go forth and implement.

I should note that this is something I have been highly resistant towards for a long time as I do not really want to get into enterprise sales cycles.

Anyway in a months time if there are enough signups I will push forward and release an initial version to those who have signed up. Any one who does will get free access on the beta list and discounts on the final version (should they need something more powerful then the community edition).


Go Forth and Search

A very fast update. At the request of the excellent Lars Brinkhoff via GitHub I have added in the language Forth to be one of the supported languages inside searchcode.

An example search which shows this working would be the following

I had to solve a number of interesting problems inside searchcode to support this change. For pragmatic reasons the way searchcode identifies what language any piece of code is written in is to run it though CLOC (Count Lines Of Code). Written in perl it does a reasonably good job of pulling out metadata for any given piece of code. However since my perl ability is poor at best submitting a patch to support forth was not going to be an option.

Instead I ended up adding an additional few checks at the end of the indexing pipeline to identify code that probably should have been categorised as forth and if so change the classification. It has been designed to be extensible so if other languages come up that are not currently identified it should be possible to add them as well.

The only other change of note for searchcode is that I fixed the SSL certificate chain and now you can curl the API again. This was an issue caused by Google throwing its weight around and outlawing SHA1 certificates. When updating to fix this I neglected to fix the chain as well. Oddly browsers worked without issue whereas curl and Python requests broke.

Exporting Documents from KnowledgeTree

I was recently tasked with exporting a large collection of documents from KnowledgeTree (KT) for a client. The collection was too large to use the download all functionality and too wide to attempt to export each folder individually.

I had played around with the WebDav connection that KT provides but it either didn’t work or was designed deliberately to not allow exporting of the documents.

I looked at where the documents were  stored on disk but KT stores them as numbered files in numbered directories sans extension or folder information.

Long story short I spent some time poking through the database to identify the tables which would contain the correct metadata which would allow me to rebuild the tree using a proper filesystem. For record the tables required are the following,

  • folders – Contains the folder tree. Each entry represents a folder and contains its parent folder id.
  • documents – Contains the documents that each folder contains. Knowing the folders id you can determine what documents live in that folder.
  • document_content_version – Contains the metadata required to get the actual file from disk. A 1 to 1 mapping between document id and this table is all that is required.

That said here is a short Python script which can be used to rebuild the folders and documents on disk. All that is required is to ensure that Python MySQLdb is installed and to set the database details. Depending on your KT install you may need to change the document location. Where  the script is run it will replicate the folder tree containing the documents preserving the structures, names and extensions.

Keep in mind this is a fairly ugly script abusing global variables and such. It is also not incredibly efficient, but did manage to extract 20GB of files in my case in a little under 10 minutes.

import MySQLdb
import os
import shutil

# KnowledgeTree default place to store documents
ktdocument = '/var/www/ktdms/Documents/'

conn = MySQLdb.connect(user='', passwd='',db='', charset="utf8", use_unicode=True)
cursor = conn.cursor()

# global variables FTW
cursor.execute('''select id, parent_id, name from folders;''')
allfolders = cursor.fetchall()

cursor.execute('''select id, folder_id from documents;''')
alldocuments = cursor.fetchall()

cursor.execute('''select document_id, filename, storage_path from document_content_version;''')
document_locations = cursor.fetchall()

# create folder tree which matches whatever the database suggests exists
def create_folder_tree(parent_id, path):
    directories = [x for x in allfolders if x[1] == parent_id]
    for directory in directories:
        d = '.%s/%s/' % (path, directory[2])
        print d
        # get all the files that belong in this directory
        for document in [x for x in alldocuments if x[1] == directory[0]]:
                location = [x for x in document_locations if document[0] == x[0]][0]
                print 'copy %s%s %s%s' % (ktdocument, location[2], d, location[1])
                shutil.copy2('%s%s' % (ktdocument, location[2]), '%s%s' % (d, location[1]))
                 print 'ERROR exporting - Usually due to a linked document.'

        create_folder_tree(parent_id=directory[0], path='%s/%s' % (path, directory[2]))

create_folder_tree(parent_id=1, path='')

Decoding CAPTCHA’s Handbook

Some time ago I wrote an article about Decoding CAPTCHA’s which has become what appears to be the first resource most people encounter when searching for information in the decoding CAPTCHA space.

I had continued to write about CAPTCHA’s over the years with posts scattered around the web. A while ago I started to consolidate all of my content on this blog and realised that I had considerably more CAPTCHA related articles then I thought. Some were in an unfinished or unpublished state. I had considered posting them all online but instead decided to polish it all up into a much better resource and publish it as a book.

The book is now online and available for sale. Its not what you would call a top seller but has produced enough sales to offset some hosting costs for the blog which was one goal. I also wanted to test the waters when trying to sell an info product. Info products according to Amy Hoy and Rob Walling are an excellent platform to start learning how to build recurring revenue online. Certainly one thing I have learnt is that something borderline unethical such as the book will never become a great seller. Simply put the audience is unlikely to convert to be a high sales business because of the shady nature of the content.

Anyway for those interested in purchasing a copy you can either visit the Decoding CAPTCHA’s page or use the below.

Decoding CAPTCHA's Book
Looking for a practical guide to CAPTCHA decoding? All About CAPTCHA’s. This eBook will teach you how to identify weaknesses and exploit CAPTCHA’s from beginning to end.

Buy now using LeanPub

Buy now using Gumroad

C# as a Language from old Google+ Post

The more I use C# as a language for writing things the more I am convinced that its approach really is the best language approach out there.

The unit test support is excellent which allows development speed to be just as fast as any dynamic language (Python, PHP, Perl).

The static typing catches so many issues before you get to runtime and allows sweeping changes without breaking things.

Unlike Java it has the var keyword (saves time and improves readability) and so many more useful functions which yes you can replicate but are just built in and work correctly.

Then you get to the really good stuff. LINQ is awesome. The lazy loading allows you to implement a repository pattern over your database which is just awesome. Set up the basic select * from then add extension methods allowing you to chain whatever you need, EG

from person in _dbContext.GetPerson().ByUserName(username).ByPassword(password);

100% elegant, easy to test, easy to write, easy to read and understand and generally works exactly as you would expect without any hidden gotchas. And because its lazy it doesn’t chew resources sucking back everything from the database.

You can use functional programming techniques if you wish, and with the new async decorators you can work in a node.js style if you with, with static typing and all existing library support.

Or you can continue to work in a C like manner, or mix it up with objects, procedural code and functional.

I switched back to Java not that long ago to write a simple server using Jetty and even with things like Guice (best DI implementation I have used so far) and Guava it was still painful. Less painful, but I really felt that the compiler was fighting me from doing things in an elegant manner most of the time. Even adding the “var” keyword would improve Java in a massive way. Add some functional programming in there and I would be pretty happy.

I just wish C# would run on the JVM as I would use it for pretty much everything in a heartbeat. As it is the Mono support is missing the stuff I really want and isn’t as seamless as the experience should be. A pity really as C# really is in my experience the nicest language to work today that’s production ready.

searchcode the path to profitability

One of the things that has always bothered me about was that it never generated any money. Not a huge problem in itself as a side project, but the costs to run it are not insignificant due to the server requirements. I had looked into soliciting donations but I considered this highly unlikely to produce enough revenue to cover costs considering that sites such as was unable to make enough to cover even basic costs through patreon (although since a recent HN post this has jumped from around $20 a month to over $150).

This had caused me back in the early days to use buysellad’s to attempt to cover some costs. While this certainly helped there was usually not enough revenue due to the way the ads are sold. The issue with buysellads is that you have to pitch your website as a good place to sell ads against. This is not something I had any great desire to do. Simply put if I am going to spend my time marketing something its going to be something that is more directly marketable then an advertising funded website. The other issue is that buysellads does not work with HTTPS which became a deal breaker for myself once Google announced that they were going to use it as a signal for ranking.

This lead me to just a few months ago considering shutting down the site. However while doing some poking around I noticed a newish advertising platform called Carbonads. Invite only I decided to email them with my pitch. Being a developer/designer focused ad platform it seemed like a natural fit. After a bit of back and forth I can now happily say that is running carbonad’s. I have no numbers to report at this time but I am very hopeful based on some estimates made with carbon ads that I should be able to push searchcode to cover costs and hopefully produce some profit, all of which will be used to improve the service.

Filetree Listing

Just a quick update to searchcode. A few small tweaks here and there, but the largest is that there is now a file tree listing option which will show the file tree for any project. An example would be going to this file and then clicking the “View File Tree” button on the top left.

An example screenshot of the result of this is included below.


Updates to

Just a quick post to list some updates to The first is a slight modification to the home page. A while ago I received an email from the excellent Christian Moore who provided some mock-ups of how he felt it should look. I loved the designs, but was busy working on other issues. Thankfully however in the last week or so I found the time to implement his ideas and the result is far more professional to me.


It certainly is a large change from the old view but one that I really like as it is very clean. The second update was based on some observations I had. I was watching a colleague use searchcode to try finding some logic inside the Thumbor image resizer project. I noticed that once he had the the file open he was trying to navigate to other files in the same project. Since the only way to do was was to perform a new search (perhaps with the repo option) I decided to add in a faster way to do this. For some projects there is now an instant search box above the code result which allows you to quickly search over all the code inside that repository. It uses the existing searchcode API’s (which you can use as well!) to do so. Other ways of doing this would include the project tree (another piece of functionality I would like to add) but this was done using already existing code so was very easy to implement. An example would be going to this result in pypy and searching for import.


As always I would love some feedback on this, but as always expecting none (par for the course).


Decoding Captcha’s Presentation

A few days ago there was a lack of speakers for #SyPy which is the Sydney Python meet-up held most months and sponsored by Atlassian. I had previously put my hand up to help out if this situation ever came up and was mostly ready with a presentation about Decoding Captchas. I did not expect it to be so full that people were standing (largest crowd I had ever seen there). Thankfully it seemed to go over well and while I need to get more practice at public speaking I did enjoy it. A few choice tweets that came out of the end of the event,


Anyway you can get all the code and the slides via Decoding Captchas Bitbucket or Decoding Captchas Github.

C# XML Cleaner Regex

One of the most annoying things I deal with is XML documents with invalid characters inside them. Usually caused by copy pasting from MS Word it ends up with invisible characters that you cannot easily find and cause XML parsers to choke. I have encountered this problem enough that I thought a quick blog post would be worth the effort.

As such here mostly for my own reference is a regular expression for C# .NET that will clean invalid XML characters from any XML file.

const string InvalidXmlChars = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";

Simply take the XML document as a string and do a simple regular expression replace over it to remove the characters. You can then import into whatever XML document structure you want and process it as normal.

For note the reason I encountered this was when I was building a series of WSDL web-services over HR data. Since the data had multiple sources merged during an ETL process and some of them were literally CSV files I hit this issue a lot. Attempts to sanitize the data failed as it was overwritten every few hours and it involved multiple teams to change the source generation. In the end I just ran the cleaner over every field before it was returned and everything worked perfectly.