Running the Numbers

So I have rolled out quite a few performance improvements. searchcode is MUCH MUCH faster then it was before. I also added some various improvements across the board in terms of relevance. This included indexing characters like !@#$%^&*()-= etc… So now things like the perl regex match =~ is now a valid search term. Of course you can combine terms and normal characters to get really complex search terms such as $localdate =~ /([0-9]+):([0-9]+):([0-9]+)/; Pretty awesome stuff I think.

I also setup a ripoff of blekko’s 3 card monte which you can view here searchcode compare to koders which compares searchcode’s results to koders (the current leader). I have already started acting on the results in this. For example I have removed html, xml and json results from the main index. You can still search for them using lang or extension syntax EG lang:html <html> but for standard searches such as <html> you will get more code results. The next thing on my list is to remove “compressed” results such as minified JavaScript and the like which should really clean the results up.

Finally a few days ago I was watching some show from the UK about brands, and in particular technology brands. One of the bits had Larry Page (of Google fame) talking about how if you printed the information Google has how tall it would be and that they can search it instantly and return the answer you want.

Naturally I had to work this out for myself on my own index. I made a few assumptions and came up with the following.

searchcode has about 2.7 billion lines of code indexed. Assuming the average A4 piece of paper holds about 32 lines you get 86 million pages to hold the printed index. Note im ignoring lines longer then 80 characters, lets assume the paper has infinite width. Assuming a piece of paper is about 0.1mm thick we can work ouy that if we stacked out paper in a pile we would have a pile 8,600 m in height. For comparison Mount Everest is about 8,800 meters in height, and the height of our imaginary pile is almost within a stones throw of the top and certainly well in the death zone!

Who Knows Regex

Apparently not many. I have been monitoring how the search has been used since I rolled out code search and noticed that most people are just typing in search terms and not regex search terms. Of course this means some results are not what people are expecting.

I have thus changed the way searches work. It now does an exact match of whatever it is you are looking for UNLESS you wrap your search term in / in which case it will default to a regex search. Take for example the following,

[cb]at vs /[cb]at/

The first will search for the exact term “[cb]at” whereas /[cb]at/ will expand out to search for terms cat OR bat anywhere in the file.

The change is slight, but should make things more accessible for most people since it is obvious through what I have been seeing that people just expect to type into a box and get results back. The only other change is that I have disabled the “Google” instant inspired search for any code search IE when the checkbox is checked. The reason being it fired off so many requests and my megre hardware was unable to cope. I think it actually works better now, but I can always turn it back on later should hardware increases permit.

Finally I had a look at the backend and it looks like there are over 2.5 billion lines of code indexed now. I do have plans to pull all sorts of interesting stats out of the code and display it on the front page but that’s a subject for another blog post.

Expanded Syntax lang Keyword Now Supported

Trawling through the logs of search queries I noticed that some people are using the Google Code Search lang syntax. An example that I spotted was the following “throw.* runtime_error lang:c++” Note the lang:c++ portion.

Of couse this ended up spitting back no useful results because the lang:c++ was treated as part of the search. Well no longer is this the case. searchcode now supports the lang keyword in addition to the existing ext one (useful for extensions).

The list of known languages is included below. The only issue at the moment is those with a space in them which isnt picked up by the filter correctly, but I will push a fix for those soonish.

EDIT The ones with spaces are no longer an issue. Just write the language minus the spaces, EG

test lang:BourneShell
NSString lang:ObjectiveC

ActionScript
Ada
ASP
ASP.Net
Assembly
awk
bc
Bourne Again Shell
Bourne Shell
C
C Shell
C/C++ Header
C#
C++
CMake
COBOL
ColdFusion
CSS
Cython
D
DAL
Dart
DOS Batch
DTD
Erlang
Expect
Fortran 77
Fortran 90
Fortran 95
Go
Groovy
Haskell
HTML
IDL
Java
Javascript
JSP
Kermit
Korn Shell
lex
Lisp
Lua
m4
make
MATLAB
Modula3
MSBuild scripts
MUMPS
MXML
NAnt scripts
Objective C
Objective C++
Ocaml
Octave
Oracle Forms
Oracle Reports
Pascal
Patran Command Language
Perl
PHP
Python
Rexx
Ruby
Ruby HTML
Scala
sed
SKILL
Smarty
Softbridge Basic
SQL
SQL Data
Tcl/Tk
Teamcenter def
Teamcenter met
Teamcenter mth
text
Unknown
VHDL
vim script
Visual Basic
XAML
XML
XSD
XSLT
yacc
YAML