So I have rolled out quite a few performance improvements. searchcode is MUCH MUCH faster then it was before. I also added some various improvements across the board in terms of relevance. This included indexing characters like !@#$%^&*()-= etc… So now things like the perl regex match =~ is now a valid search term. Of course you can combine terms and normal characters to get really complex search terms such as $localdate =~ /([0-9]+):([0-9]+):([0-9]+)/; Pretty awesome stuff I think.
Finally a few days ago I was watching some show from the UK about brands, and in particular technology brands. One of the bits had Larry Page (of Google fame) talking about how if you printed the information Google has how tall it would be and that they can search it instantly and return the answer you want.
Naturally I had to work this out for myself on my own index. I made a few assumptions and came up with the following.
searchcode has about 2.7 billion lines of code indexed. Assuming the average A4 piece of paper holds about 32 lines you get 86 million pages to hold the printed index. Note im ignoring lines longer then 80 characters, lets assume the paper has infinite width. Assuming a piece of paper is about 0.1mm thick we can work ouy that if we stacked out paper in a pile we would have a pile 8,600 m in height. For comparison Mount Everest is about 8,800 meters in height, and the height of our imaginary pile is almost within a stones throw of the top and certainly well in the death zone!
Apparently not many. I have been monitoring how the search has been used since I rolled out code search and noticed that most people are just typing in search terms and not regex search terms. Of course this means some results are not what people are expecting.
I have thus changed the way searches work. It now does an exact match of whatever it is you are looking for UNLESS you wrap your search term in / in which case it will default to a regex search. Take for example the following,
[cb]at vs /[cb]at/
The first will search for the exact term “[cb]at” whereas /[cb]at/ will expand out to search for terms cat OR bat anywhere in the file.
The change is slight, but should make things more accessible for most people since it is obvious through what I have been seeing that people just expect to type into a box and get results back. The only other change is that I have disabled the “Google” instant inspired search for any code search IE when the checkbox is checked. The reason being it fired off so many requests and my megre hardware was unable to cope. I think it actually works better now, but I can always turn it back on later should hardware increases permit.
Finally I had a look at the backend and it looks like there are over 2.5 billion lines of code indexed now. I do have plans to pull all sorts of interesting stats out of the code and display it on the front page but that’s a subject for another blog post.
Trawling through the logs of search queries I noticed that some people are using the Google Code Search lang syntax. An example that I spotted was the following “throw.* runtime_error lang:c++” Note the lang:c++ portion.
Of couse this ended up spitting back no useful results because the lang:c++ was treated as part of the search. Well no longer is this the case. searchcode now supports the lang keyword in addition to the existing ext one (useful for extensions).
The list of known languages is included below. The only issue at the moment is those with a space in them which isnt picked up by the filter correctly, but I will push a fix for those soonish.
EDIT The ones with spaces are no longer an issue. Just write the language minus the spaces, EG
Bourne Again Shell
Patran Command Language