Feedback Loop

About a month ago searchcode.com managed to sit on the front page of Hacker News (HN) for most of a day and produced a lot of useful feedback for me to act on. You can read the full details here searchcode: A source code search engine

Between the HN feedback, some I received via tweets and from republished articles I got a list of things I needed to work on.

The first and main change requested was over the way searchcode was matching results. It was by default looking for exact matches. Hence if you searched for something like “mongodb find” it would look for that exact text. It was requested by quite a few people to change this. The expectation was that the matching would work like Githubs. This has now taken effect. A sample search that came up is included below with the new logic,

https://searchcode.com/?q=MongoDBObject+find+lang%3AScala
vs
https://github.com/searchq=MongoDBObject+find+language%3Ascala&type=Code&ref=searchresults

I believe the results are more in line with the expectation.

The second thing requested was that I point at the new Google endpoints for GWT and Android. This has been done and the code is currently sitting in the queue ready to be indexed. I expect this to take place in the next few days. In addition I have pulled in a lot of new repositories from Github and Bitbucket using their API’s. The number of projects now being indexed is well over 5 million and growing every day.

The last request came from the user chdir on HN. I hope they won’t mind but I have included their request below,

“I use sourcegraph occasionally and mostly rely on Gihub search. I wish the search has all those advanced refinement options that grep & Sublime Text search has. Some examples would be to use regex, search a word within a scope of lines, search within search results etc. Additionally, it’s very useful to be able to sort the search results by stars/forks. Sometimes I just want to see how popular projects have implemented a certain feature. A keyword based search isn’t enough for that.

I guess these features are very expensive & slow to implement but it would be super useful if it can be achieved. Source code search is for geeks so it is probably fair to say that a truly advanced & complex interface won’t turn away users.”

The above is actually one of the more difficult requests. However its suggestions are on my radar of things to do. To start with I have rolled out an experimental feature which displays matching results. One of the issues with codesearch is that being good developers there is a lot of duplicate code used in various projects. Since when you search for something like “jquery mobile” you don’t want to see the same file repeated thousands of times you need to work out the duplicate content and filter it out.

Sometimes however you want to see those results. Its a piece of functionality that existed in Google Code search which I had wanted implemented for a long time. Well it is now here. The duplicates are worked out using a few methods, matching MD5 hashes, file-name and a new hash I developed myself which converges the more similar the files are. Similar to simhash this new has however does not require any post calculation operations to determine if two files are a match. More details of this will come in a later post after I iron out all the kinks.

Anyway you can now see this functionality. Try searching for “jquery mobile” and look next to the title. You can see something along the lines of “Show 76 matches”

1

Clicking the link will expand out the matching files for this result. Each of the matching results shows the filename project and the location in the project. All of course are click-able and link to the duplicate file.

2

Lastly you can also do the same on the code page itself. Just click “Show 5 matches” on the top right of the result page to see a list of the matching files.

3

There is more to come in the next few weeks which I am excited about but for the moment I would love to get feedback on the above.

What is special about DDG

Since I am still bringing all my content together I thought I would pull in this post from Quora asking what is special about DuckDuckGo.

1. Privacy enabled by default. This certainly helped get traction when the NSA security revelations came around. DDG is not the only privacy conscious search engine but certainly one that pushes it as a feature more then others. See https://duckduckgo.com/privacy

2. !bang syntax. Remember back in the early days of Google they had a “Try this search on” and a list of search engines? !bang is that idea on steroids. This makes the cost of switching to DDG much lower then any other search engine because you are not locked in when its results are lacking.

3. Gabriel Weinburg (Creator) came up with a way to index the web for a fraction the cost of anyone else. I.E. use someone else’s index through web API’s such as Bing/Yahoo Boss. This means DDG can have an index in billions of pages without buying hundreds of machines and then crawling and indexing. Consider Cuil as an example. BTW I wrote more about this here Building a search engine? The most important feature you can add.

4. Persistence. Quite a few search engines based on Yahoo Boss and other API’s have come and gone, however DDG continues to be worked on. Just being around for 4 years gives it credibility.

5. DuckDuckHack. If you are a developer you can go to DuckDuckHack and add functionality you want. This may not sound that good, but because DDG already has traffic its a good incentive for start-ups and others to build on the DDG API to get some traction, which means they want to use DDG and promote it which fuels growth.

6. People. The people working on DDG are pretty awesome.

7. Uncluttered results. The results are pretty much just some links without too much fancy stuff going on.

Sphinx and searchcode

There is a rather nice blog post on the Sphinx Search blog about how searchcode uses sphinx. Since I wrote it I thought I would include a slight edited for clarity version below. You can read the original here.

I make it no secret that the indexer that powers searchcode is Sphinx Search which for those who do not know is a stand alone indexing and searching engine similar to Solr.

Since searchcode’s inception in 2010, Sphinx has powered the search functionality and provides the raw searching and faceting functionality across 19 billion lines of source code. Each document has over 6 facets and there are over 40 million documents in the index at any time. Sphinx serves over 500,000 queries a month from this with the average query returning in less than a second.

searchcode is an unusual beast in that while it doesn’t index as many documents as other large installations, it indexes a lot more data. This is due to the average document size being larger and the way source code is delimited. The result of these requirements is that the index when built is approximately 3 to 4 times larger than the data being indexed. The special transformation’s required are accomplished with a thin wrapper on top of Sphinx which modifies the text processing pipeline. This is applied when Sphinx is indexing and running queries. The resulting index is over 800 gigabytes in size on disk and when preloaded consumes over 25 gigabytes of RAM.

This is all served by a single i7 Quad Core server with 32 gigabytes of RAM. The index is distributed and split into 4 parts allowing all queries to run over network agents and scale out seamlessly. Because of the size of the index and how long this takes each part is only indexed every week and a small delta index is used to provide recent updates.

Every query run on searchcode runs multiple times as a method of improving results and avoiding cache rot. The first query run uses the sphinx ranking mode BM25 and and subsequent queries use SPH04. BM25 uses a little less CPU then SPH04 and hence new queries use it as return time to the user is important. All subsequent queries run as a offline asynchronous task which does some further processing and updates the cache so the next time the query is run the results are more accurate. Commonly ran queries are added the the asynchronous queue after the indexes have been rotated to provide fresh search results at all times. searchcode is currently very CPU bound and given the resources could improve search times 4x with very little effort simply by moving each of the the Sphinx indexes to individual machines.

searchcode updates to the latest stable version of Sphinx for every release. This has happened for every version from 0.9.8 all the way to 2.1.8 which is currently being used. There has never been a single issue with each upgrade and each upgrade has overcome an issue that was previously encountered. This stability is one of the main reasons for having chosen Sphinx initially.

The only issues encountered with Sphinx to date where some limits on the number of facets which has been resolved with the latest versions. Any other issue has been due to configuration issues which were quickly resolved.

In short Sphinx is an awesome project. It has seamless backwards compatibility, scales up to massive loads and still returns results quickly and accurately. Having since worked with Solr and Xapian, I would still choose Sphinx as searchcode’s indexing solution. I consider Sphinx as Nginx of the indexing world. It may not have every feature possible but its extremely fast and capable and the features it does have work for 99% of solutions.

Estimating Sphinx Search RAM Requirements

If you run Sphinx Search you may want to estimate the amount of RAM that it requires in order to per-cache. This can be done by looking at the size of the spa and spi files on disk. For any Linux system you can run the following command in the directory where your sphinx index(s) are located.

ls -la /SPHINXINDEX/|egrep "spa|spi"|awk '{ SUM += $5 } END { print SUM/1024/1024/1024 }'

This will print out the number of gigabytes required to store the sphinx index in RAM and is useful for guessing when you need to either upgrade the machine or scale out. It tends to be accurate to within 200 megabytes or so in my experience.

searchcode next

There seems to be a general trend with calling the new release of your search engine next (see Iconfinder and DuckDuckGo), and so I am happy to announce and write about searchcode next.

As with many project searchcode has some very humble beginnings. It originally started out as a “I need to do something” side project originally just indexing programming documentation. Time passed and the idea eventually evolved into a search engine for all programming documentation, and then with Google Code search being shut down a code search engine as well.

searchcode was running on a basic LAMP stack. Ubuntu Linux as the server, PHP, MySQL and Apache. APC Cache was installed to speed up PHP with some memcached calls to take heat off the database. The CodeIgniter PHP framework was used for the front end design with a lot of back-end processes written in Python.

Never one to agree with the advice that you should never rewrite your code I did exactly that. Searchcode is now a Django application. The reasons for this are varied but essentially it was running on an older server (Ubuntu 10.04) and a now defunct web framework CodeIgniter. I figured since I had to rewrite portions anyway I may as well switch over to a language that I prefer and want to gain more experience in.

As mentioned searchcode is now a Django application but still backed by by MySQLSphinx provides the searching index and a healthy mix of Rabbitmq and Celery for back-end tasks. Deployments and server config is automated through the use of Fabric and Memcached is included for speed. Of course some of the original back-end processes still exist as cron jobs but are slowly being moved over to Celery tasks. It still runs on Ubuntu server since that’s the Linux distribution I am most comfortable with.

Of particular note, searchcode runs on two servers which could probably be reduced to a single one at its current size but allows for growth. Both are dedicated boxes provided by Hetzner. Both are 4 core i7 boxes with 3 terabytes of disk space each. The only difference between them is the first having 16 gigabytes of ram and the index having 32 gigabytes. The first runs the web-server nginx talking through gunicorn to django, the database and memcache. The second exclusively runs the sphinx index (more details about sphinx to come).

Load averages before the move were rather chaotic. I had seen spikes up to 100 which for a 4 core box is pretty horrible. The new version even under extreme pressure (from a Siege test and GoogleBot) maxes out about 2, with the search spiking to 4 for brief periods if a lot of un-cached searches hit all of a sudden. The other advantage is that searches come back much faster with the new setup. Average page responses have dropped considerably.

Heavily unit tested the application runs through a battery of tests before each deployment including unit, integration and smoke which do a reasonable job of catching issues out before being deployed. Of course the other benefit being that the code-base is testable which is generally a good thing.

There is more to come and I am excited about the future of searchcode.

Chatbot Nonsense

There has been a lot of coverage recently about a chat-bot “Eugene Goostman” passing a variance of the Turing test by convincing 30% of the testers that it was indeed a human by posing as Ukrainian 13 year old boy (to make misspellings and grammar mistakes forgivable I suppose).

Naturally I had to give it a try and frankly I can’t see how something like this could convince anyone that its a living human being. I asked a few questions such as the following “Whereabouts in the Ukraine do you live?” the response being “It is a country in SE Europe that’s all I can tell you” which is not exactly promising.

For fun I scripted it up against a copy of the Eliza chat-bot and had them talk among themselves. A quick excerpt is included below and is rather amusing.

Eliza: Is it important to you that your attitude to I ?
Eugene: It is highly possible. It is highly possible.
Eliza: Lets discuss further why your feelings right now.
Eugene: Just because 2 plus 2 is 5! Wonna ask me something more?
Eliza: Don't any other reasons come to mind ?
Eugene: Is it a kind of manace? Maybe, let's talk about something else? What would you like to discuss?
Eliza: Why the uncertain tone ?
Eugene: Why? Just because! :) Why? Just because! :)
Eliza: Why do you ask ?
Eugene: Oh, but why not? Well, let's go on though.
Eliza: Are such questions much on your mind ?
Eugene: Who knows... Well, let's go on though.
Eliza: Have you asked such questions before ?
Eugene: I would rather not talk about it if you don't mind. Oh, what a fruitful conversation;-)
Eliza: What would you do if I don't mind ?

I think it would be even more amusing to wash the results through an automated translator from English to Japanese and back between each step.

Not so unique GUID

I have been doing a lot of work with the Sitecore CMS recently. Once of the things you quickly learn is how it relies on GUID’s for pretty much everything. This means of course when you start testing and need to supply GUID’s into your tests that you end up with lots of GUIDs that look like the following sprinkled through your code {11111111-1111-1111-1111-111111111111}

Today I remarked that we should be using things like “deadbeef” for the first part of the GUID with a colleague. He suggested that we should try and actually write something. With a little bit of 1337 speak this is actually possible. Naturally we got back to work, but with a little free time I quickly coded up a simple Python application to generate “phrased” GUID’s. Some examples follow,

silicles-oafs-blob-tael-declassified -> {5111c1e5-0af5-b10b-7ae1-dec1a551f1ed}
deedless-gait-soft-goes-eisteddfodic -> {deed1e55-9a17-50f7-90e5-e157eddf0d1c}
libelist-diel-alls-flit-disaffiliate -> {11be1157-d1e1-a115-f117-d15aff111a7e}
offstage-diel-labs-scat-classifiable -> {0ff57a9e-d1e1-1ab5-5ca7-c1a551f1ab1e}

None of the above are make much sense, but by looking at the outputs you can attempt to write something such as,

 cassette soft gold dice collectibles
{ca55e77e-50f7-901d-d1ce-c011ec71b1e5}

Very zen. Some rough back of napkin calculations gives my program something like 10,000,000,000,000 combinations of GUID’s based on the word list I supplied. I may just turn it into a online GUID generator like this one http://www.guidgenerator.com/

EDIT – You can now get these guids at https://searchcode.com/guid/

Implementing C# Linq Distinct on Custom Object List

Ever wanted to implement a distinct over a custom object list in C# before? You quickly discover that it fails to work. Sadly there is a lack of decent documentation about this and a lot of FUD. Since I lost a bit of time hopefully this blog post can be picked up as the answer.

Thankfully its not as difficult as you would image. Assuming you have a simple custom object which contains an Id, and you want to use that Id to get a distinct list all you need to do is add the following to the object.

public override bool Equals(object obj)
{
	return this.Id == ((CustomObject)obj).Id;
}

public override int GetHashCode()
{
	return this.Id.GetHashCode();
}

You need both due to the way that Linq works. I suspect under the hood its using a hash to work out whats the same hence GetHashCode.

Installing Phindex

This is a follow on piece to my 5 part series about writing a search engine from scratch in PHP which you can read at http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

I get a lot of email requests asking how to setup Phindex on a new machine and start indexing the web. Since the article and code was written aimed at someone with a degree of knowledge of PHP this is somewhat understandable. What follows is how to set things up and start crawling and indexing from scratch.

The first thing to do is setup some way of running PHP and serve pages. The easiest way to do this is install Apache and PHP. If you are doing this on Windows or OSX then go and install XAMPP https://www.apachefriends.org/index.html For Linux follow whatever guide applies to your distribution. Be sure to follow the directions correctly and verify that you can create a file with php_info(); inside it which runs in your browser correctly.

For this I am using Ubuntu Linux and all folder paths will reflect this.

With this setup what you need to do next is create a folder where we can place all of the code we are going to work with. I have created a folder called phindex which I have ensured that I can edit and write files inside.

Inside this folder we need to unpack the code from github https://github.com/boyter/Phindex/archive/master.zip

boyter@ubuntu:/var/www/phindex$ unzip master.zip
Archive:  master.zip
2824d5fa3e9c04db4a3700e60e8d90c477e2c8c8
   creating: Phindex-master/
.......
  inflating: Phindex-master/tests/singlefolderindex_test.php
boyter@ubuntu:/var/www/phindex$

At this point everything should be running, however as nothing is indexed you wont get any results if you browse to the search page. To resolve this without running the crawler download the following http://wausita.com/documents10000.tar.gz and unpack it to the crawler directory.

boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$ tar zxvf documents10000.tar.gz
......
boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$ ls
crawler.php  documents  documents10000.tar.gz  parse_quantcast.php
boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$

The next step is to create two folders. The first is called “document” and the second “index”. These are where the processed documents will be stored and where the index will be stored. Once these are created we can run the indexer. The folders need to be created in the root folder like so.

boyter@ubuntu:/var/www/phindex/Phindex-master$ ls
add.php  crawler    index       README.md   tests
classes  documents  interfaces  search.php
boyter@ubuntu:/var/www/phindex/Phindex-master$

With that done, lets run the indexer. If you cannot run php from the command line, just browse to the php file using your browser and the index will be built.

boyter@ubuntu:/var/www/phindex/Phindex-master/$ php add.php
INDEXING 1
INDEXING 2
.....
INDEXING 10717
INDEXING 10718
Starting Index
boyter@ubuntu:/var/www/phindex/Phindex-master/$

This step is going to take a while depending on how fast the computer you are using is. Whats happening is that each of the crawled documents is processed, saved to the document store, and then finally each of the documents is indexed.

At this point everything is good. You should be able to perform a search by going to the like so,

Phindex Screenshot

At this point everything is working. I would suggest at this point you start looking at the code under the hood to see how it all works together. Start with add.php which gives a reasonable idea how to look at the crawled documents and how to index them. Then look at search.php to get an idea on how to use the created index. I will be expanding on this guide over time based on feedback but there should be enough here at this point for you to get started.