Improving the Index

The last couple of weeks I have been spending improving the index based on feedback about duplicate results. When indexing source code you will find a lot of duplicate libraries and copied code all over the place. Usually when searching for something like jQuery you don’t want to see thousands of results of the same file. However when searching for method calls you do want to see results which call the method in a similar way as you may be looking for examples of how to use it.

This makes trimming out duplicate content harder then it seems initially. You can’t just throw out any results that look similar as they may actually be what you are searching for in the latter case.

Initially was stripping out duplicates based on a hash of the files contents. This was particularly effective in stripping out copied files and libraries such as jQuery. Where it did fall down however was any file slightly modified such as with an additional newline, or a new version with a few bug fixes. With bug fixes and slight modifications occurring all over the place you end up with a lot of duplicate content. As the amount of code indexed increased this became more and more of an issue to the point that some searches had pages of useless results as each result was 99% the same file.

The example presented to me was a search for glGetFragDataLocation. The file glop.cpp would be repeated over and over again in multiple variations of what was fundamentally the same thing. At the time I was working on a solution to the issue but it was very nice to get a concrete example.

There are two methods of removing duplicate content in a search. The first is to pre-process it (like the hash that was already in use) and the second is to do it at runtime. The former cuts down on processing time, while the latter is a little more flexible as you can tweak your algorithm on the fly and code around edge cases. The recommendation of Anna Patterson of Cuil fame is to sort out duplicate content at runtime.

My personal experience in this sort of thing however is going for a hybrid approach. Pre-processing has advantages such as less overhead when serving results and since I don’t have infinite resources its useful to cut down the amount of work required at runtime where possible.

The long and short of this is that you should notice search results for queries to return gradually better and better results from this point. The hash checking still remains (cheap and fast) but I have expanded this to include duplicate content checks for results. This is done through a combination of a backend process running checks over content identifying and marking content as being a duplicate and checks when serving up the results to identify which results should be stripped out. Have a spin and let me know how it works out for you.

Growing Pains

When I started the amount of hardware required was fairly low. I did a lot of processing on other machines and pushed the results through allowing the machine that serves the site to just focus on the job of delivering results. The result was everything was running on a nice VPS provided by 2.9 Ghz CPU 512 MB ram and about 30 GB of hard disk space. This was more then enough to support the index I had and do everything required. Then Google dropped Code Search and I decided I would try to fill in the gap.

My inital implementation which is live now has about 1.2 billion lines of code indexed. As you can imagine this pushed my little VPS to the limits. With some compression I got the indexed code itself reduced to about 12 GB on disk but the index itself was another story. The index of the code itself was well over 60 GB in size. I upped the size of the hard disk by 200 GB as a stopgap measure.

However the previous change was just a stopgap. As of last night I have greatly increased the size of the index by a factor of 2-3x (final counts to come). This naturally increases the size of things on disk and adds some additional strain to the CPU and ram. Because I need massive amounts of storage, cpu, ram and network traffic to continue to expand the index I have moved over to a dedicated server with 32x the amount of ram 8x the CPU and 15x the hard disk space. This will also allow me to start with realtime index updates in the coming weeks which should keep things running smoothly.

Anyway take it for a spin. The index is far larger now with more projects indexed and the speed should be much faster across the board.