searchcode server released

searchcode server the downloadable self hosted version of is now available. A large amount of work went into the release with a variety of improvements based on feedback from the general beta releases.

searchcode server

searchcode server has a number of advantages over that will eventually be back-ported in. The full list of things to check out is included below,

  • New Single Page Application UI for smooth search experience
  • Ability to split on terms so a search for “url signer” will match “UrlSigner”
  • Massively improved performance 3x in the worst case and 20x in the best
  • Configurable through UI and configuration
  • Spelling suggestion that learns from your code

A few things of note,

  • Java 8 application built using Lucene and Spark Framework
  • Designed to work any server. The test bench server is a netbook using an Intel Atom CPU and searches return in under a second
  • Scales to Gigabytes of code and thousands of repositories
  • Works on Linux, OSX and Windows

Be sure to check it out!

searchcode server
searchcode server
searchcode server

searchcode server

A month or so ago I started collection emails on to determine if there was enough interest in a downloadable version of searchcode. The results were overwhelmingly positive. The email list grew far beyond what I would have expected, and this was in the first month. As such I have been working in this downloadable version of searchcode which will probably be called searchcode server.

Progress has been reasonably straight forward consider that is written using mostly Python and searchcode server is mostly Java. The main reason for choosing Java is that I really wanted searchcode server to be a self contained application which could be downloaded and run without the configuration and setup of additional services.

At present it is surprisingly workable. You can input repositories (git only at this point) to be indexed and after a short amount of time they will be searchable via the main interface. A few screenshots are included at the end of this post for those curious.

There is still time to sign up and be one of the first to receive access. Being on the sign up list will also give you a discount when it is actually released if you need something greater than the community edition. To register your interest use the form below, or visit the searchcode server product page.

This has been released and you can now get the actual product.

Screen Shot 2015-12-29 at 1.04.54 pm

Screen Shot 2015-12-29 at 1.11.30 pm

Pi-Hole for Ubuntu 14.04

Because of the fact that I personally work for an ad supported company and that is currently supported via third party advertising I tend to keep an eye on the state of ad blockers on the web.

Most people probably know about adblockplus and other browser extensions however there are other ways to block ad’s on ones network. One that I had previously read about was setting up your own Bind9 server on a server and adding custom rules to block them at a DNS level. Other the last week I had been playing around with this but since I am not a bind expert I was unable to get it working in a satisfactory way.

However the following article about blocking all ads using a Raspberry Pi appeared on my radar. I don’t have a Raspberry Pi, but I did have an old netbook (Asus Eee 1000HA) lying around that I was trying to find some use for. I had previously set it up with Ubuntu 14.04 and had it running under the house running OwnCloud as a test. I thought it might be a good candidate for this sort of thing.

The install was pretty easy and as simple as following the guide on It says that you need to be using Raspbian but works perfectly for me. Thankfully I have a reasonably good router (D7000 which I can highly recommend) and once the setup was done I pointed its DNS at the new server and sat back for things to start working. It did. Flawlessly.

I think the advertising industry is in for a rude shock. When these devices are as cheap as this and as simple to install its only a matter of time before they become a built in to the router itself or a plug and play.

searchcode local

I am going to copy the searchcode pitch itself below quickly before explaining it a bit further.

“searchcode offers powerful code search over billions of lines of open source code. Imagine what it could do with your private repositories.

There have been requests to offer a downloadable version of searchcode. Given enough interest a downloadable hostable version of searchcode will be offered. Register your email below to register your interest.

Note that there would be a free Community version available for all users as well as paid version offering support. Functionality would remain the same across all versions. This would be similar to how Octopus Deploy is offered.”

In short I am considering writing a hostable version of searchcode. Most likely it would consist of a Java application one could download and use to get similar results to itself (probably at smaller scale however).

Rather then actually commit to several months worth of work however I have put a message on searchcode asking for those interested to register their interest. If it sounds like something you would like please register.

I have no signup target numbers in mind or product costs etc… but I suspect given over 100 sign-ups I will actually go forth and implement.

I should note that this is something I have been highly resistant towards for a long time as I do not really want to get into enterprise sales cycles.

Anyway in a months time if there are enough signups I will push forward and release an initial version to those who have signed up. Any one who does will get free access on the beta list and discounts on the final version (should they need something more powerful then the community edition).

This has been released and you can now get the actual product.

Go Forth and Search

A very fast update. At the request of the excellent Lars Brinkhoff via GitHub I have added in the language Forth to be one of the supported languages inside searchcode.

An example search which shows this working would be the following

I had to solve a number of interesting problems inside searchcode to support this change. For pragmatic reasons the way searchcode identifies what language any piece of code is written in is to run it though CLOC (Count Lines Of Code). Written in perl it does a reasonably good job of pulling out metadata for any given piece of code. However since my perl ability is poor at best submitting a patch to support forth was not going to be an option.

Instead I ended up adding an additional few checks at the end of the indexing pipeline to identify code that probably should have been categorised as forth and if so change the classification. It has been designed to be extensible so if other languages come up that are not currently identified it should be possible to add them as well.

The only other change of note for searchcode is that I fixed the SSL certificate chain and now you can curl the API again. This was an issue caused by Google throwing its weight around and outlawing SHA1 certificates. When updating to fix this I neglected to fix the chain as well. Oddly browsers worked without issue whereas curl and Python requests broke.

Exporting Documents from KnowledgeTree

I was recently tasked with exporting a large collection of documents from KnowledgeTree (KT) for a client. The collection was too large to use the download all functionality and too wide to attempt to export each folder individually.

I had played around with the WebDav connection that KT provides but it either didn’t work or was designed deliberately to not allow exporting of the documents.

I looked at where the documents were  stored on disk but KT stores them as numbered files in numbered directories sans extension or folder information.

Long story short I spent some time poking through the database to identify the tables which would contain the correct metadata which would allow me to rebuild the tree using a proper filesystem. For record the tables required are the following,

  • folders – Contains the folder tree. Each entry represents a folder and contains its parent folder id.
  • documents – Contains the documents that each folder contains. Knowing the folders id you can determine what documents live in that folder.
  • document_content_version – Contains the metadata required to get the actual file from disk. A 1 to 1 mapping between document id and this table is all that is required.

That said here is a short Python script which can be used to rebuild the folders and documents on disk. All that is required is to ensure that Python MySQLdb is installed and to set the database details. Depending on your KT install you may need to change the document location. Where  the script is run it will replicate the folder tree containing the documents preserving the structures, names and extensions.

Keep in mind this is a fairly ugly script abusing global variables and such. It is also not incredibly efficient, but did manage to extract 20GB of files in my case in a little under 10 minutes.

import MySQLdb
import os
import shutil

# KnowledgeTree default place to store documents
ktdocument = '/var/www/ktdms/Documents/'

conn = MySQLdb.connect(user='', passwd='',db='', charset="utf8", use_unicode=True)
cursor = conn.cursor()

# global variables FTW
cursor.execute('''select id, parent_id, name from folders;''')
allfolders = cursor.fetchall()

cursor.execute('''select id, folder_id from documents;''')
alldocuments = cursor.fetchall()

cursor.execute('''select document_id, filename, storage_path from document_content_version;''')
document_locations = cursor.fetchall()

# create folder tree which matches whatever the database suggests exists
def create_folder_tree(parent_id, path):
    directories = [x for x in allfolders if x[1] == parent_id]
    for directory in directories:
        d = '.%s/%s/' % (path, directory[2])
        print d
        # get all the files that belong in this directory
        for document in [x for x in alldocuments if x[1] == directory[0]]:
                location = [x for x in document_locations if document[0] == x[0]][0]
                print 'copy %s%s %s%s' % (ktdocument, location[2], d, location[1])
                shutil.copy2('%s%s' % (ktdocument, location[2]), '%s%s' % (d, location[1]))
                 print 'ERROR exporting - Usually due to a linked document.'

        create_folder_tree(parent_id=directory[0], path='%s/%s' % (path, directory[2]))

create_folder_tree(parent_id=1, path='')

Decoding CAPTCHA’s Handbook

Some time ago I wrote an article about Decoding CAPTCHA’s which has become what appears to be the first resource most people encounter when searching for information in the decoding CAPTCHA space.

I had continued to write about CAPTCHA’s over the years with posts scattered around the web. A while ago I started to consolidate all of my content on this blog and realised that I had considerably more CAPTCHA related articles then I thought. Some were in an unfinished or unpublished state. I had considered posting them all online but instead decided to polish it all up into a much better resource and publish it as a book.

The book is now online and available for sale. Its not what you would call a top seller but has produced enough sales to offset some hosting costs for the blog which was one goal. I also wanted to test the waters when trying to sell an info product. Info products according to Amy Hoy and Rob Walling are an excellent platform to start learning how to build recurring revenue online. Certainly one thing I have learnt is that something borderline unethical such as the book will never become a great seller. Simply put the audience is unlikely to convert to be a high sales business because of the shady nature of the content.

Anyway for those interested in purchasing a copy you can either visit the Decoding CAPTCHA’s page or use the below.

Decoding CAPTCHA's Book
Looking for a practical guide to CAPTCHA decoding? All About CAPTCHA’s. This eBook will teach you how to identify weaknesses and exploit CAPTCHA’s from beginning to end.

Buy now using LeanPub

Buy now using Gumroad

C# as a Language from old Google+ Post

The more I use C# as a language for writing things the more I am convinced that its approach really is the best language approach out there.

The unit test support is excellent which allows development speed to be just as fast as any dynamic language (Python, PHP, Perl).

The static typing catches so many issues before you get to runtime and allows sweeping changes without breaking things.

Unlike Java it has the var keyword (saves time and improves readability) and so many more useful functions which yes you can replicate but are just built in and work correctly.

Then you get to the really good stuff. LINQ is awesome. The lazy loading allows you to implement a repository pattern over your database which is just awesome. Set up the basic select * from then add extension methods allowing you to chain whatever you need, EG

from person in _dbContext.GetPerson().ByUserName(username).ByPassword(password);

100% elegant, easy to test, easy to write, easy to read and understand and generally works exactly as you would expect without any hidden gotchas. And because its lazy it doesn’t chew resources sucking back everything from the database.

You can use functional programming techniques if you wish, and with the new async decorators you can work in a node.js style if you with, with static typing and all existing library support.

Or you can continue to work in a C like manner, or mix it up with objects, procedural code and functional.

I switched back to Java not that long ago to write a simple server using Jetty and even with things like Guice (best DI implementation I have used so far) and Guava it was still painful. Less painful, but I really felt that the compiler was fighting me from doing things in an elegant manner most of the time. Even adding the “var” keyword would improve Java in a massive way. Add some functional programming in there and I would be pretty happy.

I just wish C# would run on the JVM as I would use it for pretty much everything in a heartbeat. As it is the Mono support is missing the stuff I really want and isn’t as seamless as the experience should be. A pity really as C# really is in my experience the nicest language to work today that’s production ready.

searchcode the path to profitability

One of the things that has always bothered me about was that it never generated any money. Not a huge problem in itself as a side project, but the costs to run it are not insignificant due to the server requirements. I had looked into soliciting donations but I considered this highly unlikely to produce enough revenue to cover costs considering that sites such as was unable to make enough to cover even basic costs through patreon (although since a recent HN post this has jumped from around $20 a month to over $150).

This had caused me back in the early days to use buysellad’s to attempt to cover some costs. While this certainly helped there was usually not enough revenue due to the way the ads are sold. The issue with buysellads is that you have to pitch your website as a good place to sell ads against. This is not something I had any great desire to do. Simply put if I am going to spend my time marketing something its going to be something that is more directly marketable then an advertising funded website. The other issue is that buysellads does not work with HTTPS which became a deal breaker for myself once Google announced that they were going to use it as a signal for ranking.

This lead me to just a few months ago considering shutting down the site. However while doing some poking around I noticed a newish advertising platform called Carbonads. Invite only I decided to email them with my pitch. Being a developer/designer focused ad platform it seemed like a natural fit. After a bit of back and forth I can now happily say that is running carbonad’s. I have no numbers to report at this time but I am very hopeful based on some estimates made with carbon ads that I should be able to push searchcode to cover costs and hopefully produce some profit, all of which will be used to improve the service.