Rebound Project

As I mentioned in the previous entry I had started work on a new project I called portfold. Built and released without fanfare I have quietly killed it before even the month is out. Why? I realise now that it was a rebound project similar to a rebound relationship. I had been getting a little down on searchcode.com and wanted to branch out to some new technology. Once done though the itch was scratched and now I am back to working on searchcode again.

There have been a few long standing issues with searchcode that I have finally tracked down and fixed. Quickly I am going to outline what each was and what I did about it.

The first issue was that if you filtered down the results using the right hand controls that they would be lost the moment you paged through the results. I cannot remember why I didn’t implement this the first time thought I suspect it was due to issues with the code dealing with non existant filters which I had since fixed.

Another issue was that when paging thorough results sometimes searchcode would report a page where none existed or when you paged through only had a few results rather then the 20 that should be there. This one took a long time to track down. The root cause turned out to be a slight difference between how code is indexed vs searched. When searchcode indexes it performs various splitting operations over the code to ensure that a search for api.duckduck.com will find results with that exact term. It also splits it so a search for api duckduck go will also work. This process works when performing a search as well, however its not the same logic due to one being done in sphinxes index pipeline and the other in Python logic. The result is that I neglected to implement the split on the singe . in the search logic. Very annoying to say the least. I had been getting bug reports for a while about it and had built a very large bug report on it. One of the main issues with it was being unable to replicate it on my local machine. This is due to the size of the data which has outgrown the ability for a single machine to deal with.

One other annoying issue was the parsers for source code had a nasty habit of crashing the instance they were running on which required a reboot to resolve. I run these on the smallest possible digital ocean instances. Turns out that by default these do not have any swap space, which ended up causing the issue. Thankfully adding swap space is relatively easy and they have now been running for days without issue. I have since queued up another million or so projects to index and should be searchable soon.

The last issue and one I am still addressing is performance. I was a little disturbed by how long it was taking for even cached results to return. A bit of poking around concluded that all of the ajax request to get the similar results where flooding the available gunicorn workers I had on the backend. This needed to be rectified. The first step was to setup nginx to read cached results directly from memcached and avoid hitting the backend at all. The second was to increase the number of workers slightly. The result is that the page does load a lot faster for the average request now with less load on the server.

In addition to the above fixes I have also added a new piece of functionality. You can now filter by user/repo name. This works in a similar manner to the existing repo filter however you need to supply the username first and delimit with a forward slash. An example would be a search for select repo:boyter/batf

All in all I am happy with progress so far. The current plan is to focus on upgrading the API such that everything is exposed for those who wish to build on it. This will include public API’s for the following, getting a code result, finding related results, advanced filters, and pretty much everything required to in theory create a clone of searchcode. In addition I want to expand the index as much as possible by pulling in bitbucket and more of githumb. As always any feedback is greatly appreciated and I do try to implement requests.

Portfold: Topic Research Software

Every few years I have a habit of starting a new project. The goal always being to scratch my own itch and learn some new technology in the process. While I am still working on searchcode.com I really wanted to play with what I had learnt there and apply it to something new.

You can view it at portfold.com

Recently I have been taking an interest in various topics such as “Oil Gas Pipeline Failure Rates” and “Hydroelectric Dam Environmental Impacts” (both generated using Portfold). My standard workflow was enter a search term into my favorite search engine and then click through the results looking for the interesting information. Extremely time consuming I was looking to find a better way to do it.

The problem as such is that searching for information presents a collection of related blue links. What do you do then?

The result is a project called Portfold. Portfold is form of Topic Research Software. The workflow is pretty simple. You enter some search terms, and a list of results are displayed. Wait for a few moments and the results are pulled down, with all of the information extracted and in a print friendly format. Information such as the number of words and if the result is a PDF are displayed inline. It was built using Django with MySQL for the backend and AngularJS for the frontend making it a single page application.

I am not aware of anyone else solving this issue in this way. This may or may not be a good thing but I am gaining a lot of value from it.

You can view it at portfold.com however you will need an account. Email me at bboyte01@gmail.com if you would like one, however keep in mind I will be looking to charge for it at some point in the future simply because there are direct costs to running the service.

If you would like to see some examples of what sort of reports Portfold can produce please see the following detailed report outputs.

Prescription Drugs Suicide Link
Risperidone
Dichloro Diphenyl Trichloroethane
Backup And Recovery Approaches Using Aws
Australia Privacy Laws
Oil Gas Pipeline Failure Rates
Hydroelectric Dam Environmental Impacts
Democrat Candidates Republican Candidates Equal Pay
Migration And Maritime Powers Legislation Amendment (Resolving The Asylum Legacy Caseload) Bill 2014
Man Haron Monis