Chatbot Nonsense

There has been a lot of coverage recently about a chat-bot “Eugene Goostman” passing a variance of the Turing test by convincing 30% of the testers that it was indeed a human by posing as Ukrainian 13 year old boy (to make misspellings and grammar mistakes forgivable I suppose).

Naturally I had to give it a try and frankly I can’t see how something like this could convince anyone that its a living human being. I asked a few questions such as the following “Whereabouts in the Ukraine do you live?” the response being “It is a country in SE Europe that’s all I can tell you” which is not exactly promising.

For fun I scripted it up against a copy of the Eliza chat-bot and had them talk among themselves. A quick excerpt is included below and is rather amusing.

Eliza: Is it important to you that your attitude to I ?
Eugene: It is highly possible. It is highly possible.
Eliza: Lets discuss further why your feelings right now.
Eugene: Just because 2 plus 2 is 5! Wonna ask me something more?
Eliza: Don't any other reasons come to mind ?
Eugene: Is it a kind of manace? Maybe, let's talk about something else? What would you like to discuss?
Eliza: Why the uncertain tone ?
Eugene: Why? Just because! :) Why? Just because! :)
Eliza: Why do you ask ?
Eugene: Oh, but why not? Well, let's go on though.
Eliza: Are such questions much on your mind ?
Eugene: Who knows... Well, let's go on though.
Eliza: Have you asked such questions before ?
Eugene: I would rather not talk about it if you don't mind. Oh, what a fruitful conversation;-)
Eliza: What would you do if I don't mind ?

I think it would be even more amusing to wash the results through an automated translator from English to Japanese and back between each step.

Regarding the Zombie Apocalypse

This piece of content is taken from a comment I left on the TechZing podcast blog. I should note I have not even begun to explore issues such as what happens to a zombie in extreme heat or cold. Of course much of the below can be disregarded if the zombie virus is airborne, but this assumes the standard zombie canon of being spread through bites.

My take on the zombie apocalypse was always that it could never happen. The reasons being,

1. The zombies primary enemy is also its main food source. This is like having to tackle a Lion every time you feel like eating a sandwich. You are going to get mauled.

2. The zombies only method of reproducing is also biting its primary enemy. Again, every time you feel randy go tackle a Lion which has the intent to maul you. Keep in mind in order to be effective each zombie needs to bite at least 2 humans, which leads us nicely to…

3. Humans are bloody good at killing things. This includes a great number of creatures which have far more effective killing implements then we were given by nature (Lions, Tigers, Bears, oh my!) I don’t know about you, but I am pretty sure I could take out 20 zombies in a car without too many issues. Quite a few people have cars. Certainly more then 1 in 20 people in a first world country do. Even if they only take out 2 zombies each we are ahead.

Add in all the gun nuts looking for something to shoot, people with medieval suits of armor (bite that zombie!), wannabe ninjas with swords, kung-fu experts, bomb nuts and the fact that a tank or even lightly armored vehicle is totally impervious to a zombie and I can’t see them lasting too long. Heck a mob armed with rocks only has to take out one zombie each to be effective as each zombie still needs to bite two before being stoned to death. You can see the numbers are clearly on our side. Even armed with sticks I can see humans winning this one.

Anyway that’s my thinking on this. What I find more scary is that there are people prepared for the zombie apocalypse and even worse is that quite a few of them would be hoping it will occur.

Who Knows Regex

Apparently not many. I have been monitoring how the search has been used since I rolled out code search and noticed that most people are just typing in search terms and not regex search terms. Of course this means some results are not what people are expecting.

I have thus changed the way searches work. It now does an exact match of whatever it is you are looking for UNLESS you wrap your search term in / in which case it will default to a regex search. Take for example the following,

[cb]at vs /[cb]at/

The first will search for the exact term “[cb]at” whereas /[cb]at/ will expand out to search for terms cat OR bat anywhere in the file.

The change is slight, but should make things more accessible for most people since it is obvious through what I have been seeing that people just expect to type into a box and get results back. The only other change is that I have disabled the “Google” instant inspired search for any code search IE when the checkbox is checked. The reason being it fired off so many requests and my megre hardware was unable to cope. I think it actually works better now, but I can always turn it back on later should hardware increases permit.

Finally I had a look at the backend and it looks like there are over 2.5 billion lines of code indexed now. I do have plans to pull all sorts of interesting stats out of the code and display it on the front page but that’s a subject for another blog post.

Expanded Syntax lang Keyword Now Supported

Trawling through the logs of search queries I noticed that some people are using the Google Code Search lang syntax. An example that I spotted was the following “throw.* runtime_error lang:c++” Note the lang:c++ portion.

Of couse this ended up spitting back no useful results because the lang:c++ was treated as part of the search. Well no longer is this the case. searchcode now supports the lang keyword in addition to the existing ext one (useful for extensions).

The list of known languages is included below. The only issue at the moment is those with a space in them which isnt picked up by the filter correctly, but I will push a fix for those soonish.

EDIT The ones with spaces are no longer an issue. Just write the language minus the spaces, EG

test lang:BourneShell
NSString lang:ObjectiveC

Bourne Again Shell
Bourne Shell
C Shell
C/C++ Header
DOS Batch
Fortran 77
Fortran 90
Fortran 95
Korn Shell
MSBuild scripts
NAnt scripts
Objective C
Objective C++
Oracle Forms
Oracle Reports
Patran Command Language
Softbridge Basic
SQL Data
Teamcenter def
Teamcenter met
Teamcenter mth
vim script
Visual Basic

Growing Index

Happy to announce that the code index is growing nicely. Even as you read this blog (assuming anyone does read it) thousands of new files are being added to the index. I suspect in time the code index will be one of the larger ones on the web. At last count it was well over 2 billion lines of code but I suspect this has grown quite a lot.

As an example I was searching for the following, “mCopyMiniSpaceObject” which can be found in game antares and is hosted on Google Code. To test out how deep indexers had progressed I compared the results among quite a few of the search engines out there. With the exception of koders (which produces no meaningful results to my eye) not a single other engine had a single result.

Of course I am sure a case can be found which has the opposite effect but it is nice to have some code which can only be found through Hopefully this will spur everyone else on to improve things.

The example searches are just below, firstly for code specific search engines,

Google Code Would expect this to have it
Giggle Bang

And some none code specific search engines



Just a quick blog post to point out that some speed improvements have been implemented. I have a few more tricks to roll out soon but on the whole things are much more snappy now. I have also been modifying my indexers and can now update the index ad-hoc rather then in large batches.

I have some more stuff coming soon including search restricted to languages EG Java / C++ and an updated code view.

Finally I added some CSS3 rounded corners to a few things because it makes the design look much better then before.

Improving the Index

The last couple of weeks I have been spending improving the index based on feedback about duplicate results. When indexing source code you will find a lot of duplicate libraries and copied code all over the place. Usually when searching for something like jQuery you don’t want to see thousands of results of the same file. However when searching for method calls you do want to see results which call the method in a similar way as you may be looking for examples of how to use it.

This makes trimming out duplicate content harder then it seems initially. You can’t just throw out any results that look similar as they may actually be what you are searching for in the latter case.

Initially was stripping out duplicates based on a hash of the files contents. This was particularly effective in stripping out copied files and libraries such as jQuery. Where it did fall down however was any file slightly modified such as with an additional newline, or a new version with a few bug fixes. With bug fixes and slight modifications occurring all over the place you end up with a lot of duplicate content. As the amount of code indexed increased this became more and more of an issue to the point that some searches had pages of useless results as each result was 99% the same file.

The example presented to me was a search for glGetFragDataLocation. The file glop.cpp would be repeated over and over again in multiple variations of what was fundamentally the same thing. At the time I was working on a solution to the issue but it was very nice to get a concrete example.

There are two methods of removing duplicate content in a search. The first is to pre-process it (like the hash that was already in use) and the second is to do it at runtime. The former cuts down on processing time, while the latter is a little more flexible as you can tweak your algorithm on the fly and code around edge cases. The recommendation of Anna Patterson of Cuil fame is to sort out duplicate content at runtime.

My personal experience in this sort of thing however is going for a hybrid approach. Pre-processing has advantages such as less overhead when serving results and since I don’t have infinite resources its useful to cut down the amount of work required at runtime where possible.

The long and short of this is that you should notice search results for queries to return gradually better and better results from this point. The hash checking still remains (cheap and fast) but I have expanded this to include duplicate content checks for results. This is done through a combination of a backend process running checks over content identifying and marking content as being a duplicate and checks when serving up the results to identify which results should be stripped out. Have a spin and let me know how it works out for you.

Growing Pains

When I started the amount of hardware required was fairly low. I did a lot of processing on other machines and pushed the results through allowing the machine that serves the site to just focus on the job of delivering results. The result was everything was running on a nice VPS provided by 2.9 Ghz CPU 512 MB ram and about 30 GB of hard disk space. This was more then enough to support the index I had and do everything required. Then Google dropped Code Search and I decided I would try to fill in the gap.

My inital implementation which is live now has about 1.2 billion lines of code indexed. As you can imagine this pushed my little VPS to the limits. With some compression I got the indexed code itself reduced to about 12 GB on disk but the index itself was another story. The index of the code itself was well over 60 GB in size. I upped the size of the hard disk by 200 GB as a stopgap measure.

However the previous change was just a stopgap. As of last night I have greatly increased the size of the index by a factor of 2-3x (final counts to come). This naturally increases the size of things on disk and adds some additional strain to the CPU and ram. Because I need massive amounts of storage, cpu, ram and network traffic to continue to expand the index I have moved over to a dedicated server with 32x the amount of ram 8x the CPU and 15x the hard disk space. This will also allow me to start with realtime index updates in the coming weeks which should keep things running smoothly.

Anyway take it for a spin. The index is far larger now with more projects indexed and the speed should be much faster across the board.

Updates Coming Soon

A few updates should be rolling out soon to with the first being a big jump on the amount of open source/free software code being indexed. At last check there was about 1.1 billion lines of code indexed. This should more then double in the next release to well over 2 billion. When I have an exact count I will publish it here.

Secondly some massive speed increases are on the way. This will mostly be due to beefing up the server that its running on (subject of another to be written blog post). Finally I am working on adding a tree view of projects into the mix so you can navigate between files on the same project.

I do have a backlog of requests from people which are yet to be worked on. The most important as far as I can see is extending the regex search to work over lines as currently its limited to individual lines. I have this working in a branch but need to fix some performance issues before rolling it out to the public at large.

Errors in Search

EDIT – This has now been resolved. All the below searches should work correctly, with the exception of XCompositeGetOverlayWindow. I am adding that to the index to be refreshed sometime in the next month or so.

Well thanks to some sample searches being thrown against the codesearch index I can finally start tuning issues that have cropped up. The main issue I have currently is searches returning no results where you would expect some. Some examples are included below.

/com\.google\.gwt.*A/ ext:java
/sql\.append.*se/ ext:java
/com\.google\.gwt.*AsyncCallback/ ext:java

Most of the problems I found were during my outage window which is good to discover. With the exception of the last query above which returns nothing due to the index not having indexed (yet) all issues are due to a undiscovered bug.

Take the following example,


The problem is the way which I interpreted the regex. Essentially on the back-end a lot of the regex is expanded out fully for a certain amount of cases and a list of matches are generated. This is to ensure that it can run quickly. Think of it as precaching every possible regex against all lines in all the files. The problem in this case is that I have a unit test case missing. I never included a test which matched the above and because of this there is a bug in the way that is expanded out to match everything. A modified version of the above which does return results is,

/memcpy.* sizeof/

The above query returns results as would be expected. In fact all of the ones above can be rewritten to work correctly. Such as the below,

/com\.google\.gwt.* A/ ext:java
/sql\.append.* se/ ext:java
/memcpy.* sizeof/
/com\.google\.gwt.* AsyncCallback/ ext:java

If you try the above queries you will see they act as expected. I will be updating code shortly to take this case into consideration, and of course post an update here.