Filetree Listing

Just a quick update to searchcode. A few small tweaks here and there, but the largest is that there is now a file tree listing option which will show the file tree for any project. An example would be going to this file and then clicking the “View File Tree” button on the top left.

An example screenshot of the result of this is included below.


Updates to

Just a quick post to list some updates to The first is a slight modification to the home page. A while ago I received an email from the excellent Christian Moore who provided some mock-ups of how he felt it should look. I loved the designs, but was busy working on other issues. Thankfully however in the last week or so I found the time to implement his ideas and the result is far more professional to me.


It certainly is a large change from the old view but one that I really like as it is very clean. The second update was based on some observations I had. I was watching a colleague use searchcode to try finding some logic inside the Thumbor image resizer project. I noticed that once he had the the file open he was trying to navigate to other files in the same project. Since the only way to do was was to perform a new search (perhaps with the repo option) I decided to add in a faster way to do this. For some projects there is now an instant search box above the code result which allows you to quickly search over all the code inside that repository. It uses the existing searchcode API’s (which you can use as well!) to do so. Other ways of doing this would include the project tree (another piece of functionality I would like to add) but this was done using already existing code so was very easy to implement. An example would be going to this result in pypy and searching for import.


As always I would love some feedback on this, but as always expecting none (par for the course).


Decoding Captcha’s Presentation

A few days ago there was a lack of speakers for #SyPy which is the Sydney Python meet-up held most months and sponsored by Atlassian. I had previously put my hand up to help out if this situation ever came up and was mostly ready with a presentation about Decoding Captchas. I did not expect it to be so full that people were standing (largest crowd I had ever seen there). Thankfully it seemed to go over well and while I need to get more practice at public speaking I did enjoy it. A few choice tweets that came out of the end of the event,


Anyway you can get all the code and the slides via Decoding Captchas Bitbucket or Decoding Captchas Github.

C# XML Cleaner Regex

One of the most annoying things I deal with is XML documents with invalid characters inside them. Usually caused by copy pasting from MS Word it ends up with invisible characters that you cannot easily find and cause XML parsers to choke. I have encountered this problem enough that I thought a quick blog post would be worth the effort.

As such here mostly for my own reference is a regular expression for C# .NET that will clean invalid XML characters from any XML file.

const string InvalidXmlChars = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";

Simply take the XML document as a string and do a simple regular expression replace over it to remove the characters. You can then import into whatever XML document structure you want and process it as normal.

For note the reason I encountered this was when I was building a series of WSDL web-services over HR data. Since the data had multiple sources merged during an ETL process and some of them were literally CSV files I hit this issue a lot. Attempts to sanitize the data failed as it was overwritten every few hours and it involved multiple teams to change the source generation. In the end I just ran the cleaner over every field before it was returned and everything worked perfectly.

How searchcode is tested

As big believer in testing as a methodology of improving code quality, one of my goals when rewriting searchcode was to ensure that it had a comprehensive test suite. What follows is how I am testing searchcode, the issues I hit and where I think I am getting the most value.

A brief overview of the architecture is required to understand what needs to be tested. At its core searchcode is a MySQL backed Django application using Sphinx as its indexing solution. The sphinx pipeline is the first thing that I wanted tested when I started the project. Because of the modifications to sphinx’s indexing pipeline and what characters Sphinx itself became a candidate for testing. Not its code persay but how it is configured. The tests start with the Sphinx config file. Before each run it is copied from source control and configured to be used by Sphinx. A collection of terms which are known to have caused issues in the past is kept. During the test each term is indexed using sphinx then a series of queries positive and negative matching cases are run. Due to the high amount of integration these tests take a while to run and generally are only run before a doing a full deployment or if changing how the pipeline operates.

The next series of tests is achieved using Django’s unit test framework. Not really pure unit tests as Django will spin up a temp database, which does however ensure that the database connections work along with permissions which saves testing these separately. Each test is written to be as unit as possible with multiple assertions. These are reasonably fast to run and are usually run in a test driven development manner as a result. The tests are split by models with all view tests existing in a single file. View tests are very simple and generally integration tests are prefered. The model tests are very detailed and attempt to get 100% branch coverage of all code paths.

There are a number of custom libraries, celery tasks and helper modules. Each of these contain their own seperate tests which can be run individually. However since they are actually quite fast they tend to get lumped together and run in batch. A lot of these tests run regressions which are taken from issues found on the main website. Rather then craft individual specific tests I have taken the approach of including the source code source for the test case. This ensures that any bizzare edge case in the logic itself is caught. Because of this there is a large collection of source code which does not belong to search code found with these tests.

A larger integration test suite is also run against a running Django instance. The instance is usually just the Django test webserver used for development however occasionally they are run against a full stack. The following steps are done for this. The first is that the parser used by searchcode to index code is run against searchcodes own source code. This is then indexed before any integration is run. The integration suite then using very simple urllib2 calls hits various URL’s and checks the return values. These tests are favoured for things such as API calls as it mimics as closely as possible how the API would actually be used. The asserts are generally as simple as checking if a specific string is in the return and the return code. This avoids creating flaky tests that break with any style changes.

Other integration tests run against the small amount of Javascript running on the site. These are very limited and consist of running a PhantomJS headless browser against several URL’s which should produce Ajax requests. The calls are checked to ensure that the requests were made and that there were no Javascript issues on the page. They act more as a sanity filter as while the request is checked the result is not, hence they do not generally catch changes to the API breaking things. Certainly there is room for improvement here.

Finally there is a collection of load tests. These are run very infrequently since the production rollout. They consist of a collection of URL’s which are fed into siege. For the test a full production ready instance is spun up with a subset of production data. The siege test is run against this and at the conculustion the error logs and memcached stats are pulled back for analysis. This is mostly a manual process of checking to ensure that everything seems fine. Since the actual production deployment it has only been run when performing major caching changes which had the potential to cause large outages. The other collection of load tests consist of of running SQL which increases the database size to about 100,000 items before indexing. This quickly identifies any performance problems when large algorithm changes are made.

With the exception of load tests all tests are grouped into into fast and slow categories. The fast consist of the Django, library and full suite of integration tests. Generally they run in 20 seconds or less and are run frequently during development, and always before comitting changes. When work is performed on a specific modules the tests are run in isolation. Before any production deployment the full suite of tests are run which takes about two minutes to run. Each group of test belongs to a seperate fabic task and allows for quick checks over most of the codebase to ensure everything is working correctly before pushing changes live.

As would be expected the most value comes from the integration tests running against an active website. Pushing most code paths these tests cover all the functionality and most of the code paths. It is also quite lucky that searchcode is able to “self host” and run integration tests against its own code. This makes it a real test case and has proven to be one of the most valuable parts of the whole process.

The whole test quite was designed from the beginning to have no “flaky” tests. That is every test must run in the same manner every time it is run. Since integration tests are the usual suspect for flakey tests. purging the database and reloading it each time which slowed down the test suite but saved time tracking down issues with broken tests. There are also no external calls to 3rd parties which really helps ensure that the tests work each time.

I have found that these suites in practice have caught all issues before they went live. In fact the only time that any breaking bug was intruduced was due to myself skipping parts of the process and pushing live before running the full spectrum of tests. Generally most issues are due to being unable to have a local copy of production data due to size constraints.

Any questions or comments? Feel free to email me at or hit me up on twitter @boyter

Regular Expressions are Fast, Until they Aren’t

TL/DR: Regular expressions are fast, until they aren’t. How I got a 20x performance by switching to string functions.

With the new version of one of the main things I wanted to address was performance. The previous version had all sorts of performance issues which were not limited to the usual suspects such as the database or search index.

When developing the new version one of the tasks listed in my queue was to profile search operations for anything slowing things down. Sadly I have since lost the profile output but observed that one of the main speed culprits is the format_results function inside the code model. For most queries I tried while it was the slowest operation it wasn’t worth optimising simply because its impact was so low. I did however keep it in the back of my mind that if there were any performance issues it would be the first thing to look at.

The final step in any profiling however is to load some production data and run a load test. My personal preference being to generate a heap of known URL’s and tell a tool like Siege to go crazy with them. The results showed that 95% of pages loaded very quickly, but some took over 30 seconds. These instantly prompted further investigation.

A quick look at the profiler showed that the “back of mind function” was now becoming a serious issue. It seemed that for some regular expressions over some data types the worst case time was very bad indeed.

All of a sudden that function has become a major bottleneck.This needed to be resolved, to fix the worst case without selling out the best case which was very fast indeed. To get an understanding of what the function does you need to understand how searchcode works. When you search for a function of snippet searchcode tries to search for something that matches exactly what you are looking for first, and something containing anything in your query second. This means you end up with two match types, exact matches and fuzzy matches. The results are then processed by firstly trying to match the query exactly, and then going for a looser match.

This was implemented initially though two regular expressions like the below,

exact_match = re.compile(re.escape(search_term), re.IGNORECASE)
loose_match = re.compile('|'.join([re.escape(x) for x in search_term.split(' ')], re.IGNORECASE)

As you can see they are compiled before being handed off to another function which uses them for the actual matching. These are fairly simple regular expressions with the first just looking for any match and the second a large OR match. Certainly you would not expect them to cause any performance issues. Reading the following on stack overflow regarding the differences certainly seems to suggest that unless you are doing thousands of matches the performance should be negligible.

At heart searchcode is just a big string matching engine, and it does many thousands or hundreds of thousands of match operations for even a simple search. Since I don’t actually use of the power of regular expressions the fix is to change the code so that we pass in an array of terms to search for and use a simple Python in operator check.

exact_match = [search_term.lower()]
loose_match = [s.lower().strip()
                for s in search_term.split(' ')]

The results? Well remember we want to improve the worst case without selling out the best case, but the end result was pages that were taking nearly a minute to return were coming back in less than a second. All other queries seemed to come back either in the same time or faster.

So it turns our regular expressions are fast most of the time. Certainly I have never experienced and performance issues with them up till now. However at edge cases like the one in searchcode you can hit a wall and its at that point you need to profile and really re-think your approach.

Rebound Project

As I mentioned in the previous entry I had started work on a new project I called portfold. Built and released without fanfare I have quietly killed it before even the month is out. Why? I realise now that it was a rebound project similar to a rebound relationship. I had been getting a little down on and wanted to branch out to some new technology. Once done though the itch was scratched and now I am back to working on searchcode again.

There have been a few long standing issues with searchcode that I have finally tracked down and fixed. Quickly I am going to outline what each was and what I did about it.

The first issue was that if you filtered down the results using the right hand controls that they would be lost the moment you paged through the results. I cannot remember why I didn’t implement this the first time thought I suspect it was due to issues with the code dealing with non existant filters which I had since fixed.

Another issue was that when paging thorough results sometimes searchcode would report a page where none existed or when you paged through only had a few results rather then the 20 that should be there. This one took a long time to track down. The root cause turned out to be a slight difference between how code is indexed vs searched. When searchcode indexes it performs various splitting operations over the code to ensure that a search for will find results with that exact term. It also splits it so a search for api duckduck go will also work. This process works when performing a search as well, however its not the same logic due to one being done in sphinxes index pipeline and the other in Python logic. The result is that I neglected to implement the split on the singe . in the search logic. Very annoying to say the least. I had been getting bug reports for a while about it and had built a very large bug report on it. One of the main issues with it was being unable to replicate it on my local machine. This is due to the size of the data which has outgrown the ability for a single machine to deal with.

One other annoying issue was the parsers for source code had a nasty habit of crashing the instance they were running on which required a reboot to resolve. I run these on the smallest possible digital ocean instances. Turns out that by default these do not have any swap space, which ended up causing the issue. Thankfully adding swap space is relatively easy and they have now been running for days without issue. I have since queued up another million or so projects to index and should be searchable soon.

The last issue and one I am still addressing is performance. I was a little disturbed by how long it was taking for even cached results to return. A bit of poking around concluded that all of the ajax request to get the similar results where flooding the available gunicorn workers I had on the backend. This needed to be rectified. The first step was to setup nginx to read cached results directly from memcached and avoid hitting the backend at all. The second was to increase the number of workers slightly. The result is that the page does load a lot faster for the average request now with less load on the server.

In addition to the above fixes I have also added a new piece of functionality. You can now filter by user/repo name. This works in a similar manner to the existing repo filter however you need to supply the username first and delimit with a forward slash. An example would be a search for select repo:boyter/batf

All in all I am happy with progress so far. The current plan is to focus on upgrading the API such that everything is exposed for those who wish to build on it. This will include public API’s for the following, getting a code result, finding related results, advanced filters, and pretty much everything required to in theory create a clone of searchcode. In addition I want to expand the index as much as possible by pulling in bitbucket and more of githumb. As always any feedback is greatly appreciated and I do try to implement requests.

Portfold: Topic Research Software

Every few years I have a habit of starting a new project. The goal always being to scratch my own itch and learn some new technology in the process. While I am still working on I really wanted to play with what I had learnt there and apply it to something new.

You can view it at

Recently I have been taking an interest in various topics such as “Oil Gas Pipeline Failure Rates” and “Hydroelectric Dam Environmental Impacts” (both generated using Portfold). My standard workflow was enter a search term into my favorite search engine and then click through the results looking for the interesting information. Extremely time consuming I was looking to find a better way to do it.

The problem as such is that searching for information presents a collection of related blue links. What do you do then?

The result is a project called Portfold. Portfold is form of Topic Research Software. The workflow is pretty simple. You enter some search terms, and a list of results are displayed. Wait for a few moments and the results are pulled down, with all of the information extracted and in a print friendly format. Information such as the number of words and if the result is a PDF are displayed inline. It was built using Django with MySQL for the backend and AngularJS for the frontend making it a single page application.

I am not aware of anyone else solving this issue in this way. This may or may not be a good thing but I am gaining a lot of value from it.

You can view it at however you will need an account. Email me at if you would like one, however keep in mind I will be looking to charge for it at some point in the future simply because there are direct costs to running the service.

If you would like to see some examples of what sort of reports Portfold can produce please see the following detailed report outputs.

Prescription Drugs Suicide Link
Dichloro Diphenyl Trichloroethane
Backup And Recovery Approaches Using Aws
Australia Privacy Laws
Oil Gas Pipeline Failure Rates
Hydroelectric Dam Environmental Impacts
Democrat Candidates Republican Candidates Equal Pay
Migration And Maritime Powers Legislation Amendment (Resolving The Asylum Legacy Caseload) Bill 2014
Man Haron Monis

Why isn’t 100% free software

The recent surge in attention to searchcode from the Windows 9/10 naming fiasco resulted in a lot of questions being raised about searchcode’s policy about free software (open source). While the policy is laid out on the about page some have raised the issue of the ethics about using such a website which is not 100% free (as in freedom).

For the purposes of the rest of this post “free software” is going to refer to software defined as not infringing freedom rather then free as in beer.

Personally I believe in the power of free software. Personally I have contributed to a projects over the years, either submitting bug reports, patches/pull requests, supplying feedback or releasing my own code under a permissive license. With searchcode my policy has been to contribute back where it makes personal sense (I will get to the personal part later). This includes so far opening all of the documentation parsers into the DuckDuckGo Fathead project, releasing a simple bug tracker I was using and by promising to donate 10% of profits to free software projects which I find most beneficial.

The personal portion is the important one to take note of here. The main reason why searchcode is not 100% free software is the support burden it would create for myself. I am running searchcode 100% on my own time, using my own tools, servers, software and hardware paid for by myself. All of this takes part outside my day job. I really do not have the time to deal with the support overhead that is bound to come from opening such a project.

How do I know it will create such an overhead? Consider the following personal examples. Search for “decoding captchas” in your choice of search engine. Somewhere near the top you will find an article hosted on this website I wrote several years ago. The article was written so that anyone trying to decode a CAPTCHA would have a good foundation of ideas and code to work from. To date, this single article has resulted in nearly an email every day from someone asking for assistance. This would not normally be a problem, except that 99% of the emails consist of either questions that the article already answered, or something to the effect of “I want to decode a captcha, plz supply the codes in VB”. Polite responses to such emails where I state I will not do this even if I were being and that everything required is already available have resulted in abuse and threats.

Another example is from the following collection of posts and the source on github. This small collection of posts also produces a lot of email from people asking questions. To reduce the overhead I ended up writing a follow up post which I can redirect a lot of the questions to. Even with both these resources I still get a lot of questions about how they can just set things up and have it working.

My point here isn’t to complain. I wrote the above knowing I would get requests for help. I usually amend the post in question when asked a few times for details. Generally I enjoy responding to each request. The issue is that searchcode is a lot more complicated then the above projects combined and the support requests that are bound to come from opening it with no obvious benefit to me outweigh any benefits I am likely to see. I could indeed write documentation for this but since I do not believe in infrastructure as text documents I prefer to keep it all as code.

You might note that I am being purely selfish about this, and that opening is not necessarily for my benefit but the benefit of others and you would be right. However you also need to remember that it shouldn’t be detrimental to me either. Keep in mind searchcode makes no money and is a side project which fills a need I had and which I am happy working on on a day to day basis.

That said, if I ever get bored of searchcode and close it down I promise to release 100% of the source as free software. I also will revisit this current policy if searchcode ever produces income beyond covering hosting expenses.

I hope this clears up some of the questions that keep popping up. If you disagree (and I am sure many do) feel free to email me stating outlining your reasons. I am not above changing my mind if delivered a well reasoned argument.

Interesting Code Comment

Found the following comment in some code I had modified a few years ago.

Just to set this up, its an existing application I had no hand in creating, and is a totally atrocity of 180,000 lines of untested code (and pretty much un-testable) which through the abuse of extension methods lives in a single class spread out across multiple files.

This is evil but necessary. For some reason people have put validation rules here rather then in the bloody ValidationHelper. Thanks to their incompetence or genius... we now have no idea if we add the extra validation in the correct place and call it here if it will work. Since this is also 180,000 lines of non tested nor testable code (without refactoring) I have no confidence in making any changes. Sure we have subversion but that dosnt allow us to code fearlessly ripping apart methods and refactoring since we have no test safety net.

I guess the obligatory car analogy would be driving down the highway, carrying nuclear waste, in an open container, in a snow storm, with acid/lsd/ice fueled drugie ninja bikies attacking you, while on fire, while juggling chainsaws, and all of a sudden you need to change the tyre. So much is going on that its you dont want to risk it and then when forced to do so 
you know its going to end up badly.

If you are still reading this then for the love of all things holy, help by refacting stuff so we can test it properly. The DAO layer should be fairly simple but everthing else is a shambles. 

Rant time over. Lets commit sin by adding more validation.