searchcode plexus

Plexus “A combination of interlaced parts; a network.”

For a while I have been neglecting searchcode.com while working on searchcode server. This is of course not an ideal situation and as such I have started working on it again. The following is just a brief list of things I am considering, problems, issues etc…

So back when I first started working on searchcode I wrote it using PHP. For searchcode next (the current version) I rewrote it using Python and Django. I had intented to continue using this stack for the forseeable future when I chose it. I quite like writing Python code and at the time the performance was acceptable. When I started writing searchcode server however I wanted to make it as easy to install as possible. The reason being that I considered it a competitive advantage to have users installing searchcode without any pain and be off and running. At the time I considered writing it in Go or Rust but had little experience with either language. I decided to pull on my sensible pants and chose Java which is a language I had worked with before and had excellent library support. With the 8th version Java had become considerably less painful to work with.

This of course means I had two code bases which do roughtly the same thing. The Python/Django codebase and the Java one. I was porting code from Python into Java and then improving it where possible without backporting. At this point I can confidently say that the Java codebase is much better, with addtional performance (not due to Java) and bug fixes. Its also far more malleable with the static typing making refacotoring a breeze. As such I have started to look at moving searchcode.com over to use the same code that searchcode server uses.

This is not without its faults. There is a lot of code to port over and they both work in very different ways.

Take for example the indexing pipeline.

In searchcode.com the indexing pipeline is a bunch of cobbled together scripts of Python that do some analysis on code but mostly call out to external programs such as cloc and sloccount to determine what language is being used and to get other information. The current pointer to a repository is kept in the database and incremented every time the job is run so it picks up the next repository. There are a few of instances of the programming running and it runs constantly in the background on the lowest nice value.

By contrast searchcode server works around background jobs. There are multiple jobs that perform different tasks. One fires off every 10 minutes adding repositories to be indexed. Another picks up the repositories, checks it out and processes it before putting the resulting files on a queue. Another picks up items from that queue and then indexes them. They run on low thread priority with checks to ensure that they don’t impact search performance.

To move from the former to the latter requires effectivly a rewrite. This is because on searchcode.com the code is stored in a central database whereas searchcode server keeps a local copy at all times. Also searchcode server performs all the code analysis itself using its own database of languages and other tricks.

Recently I went through a large clean up of the searchcode server codebase. In doing so I kept in the back of my mind that I should consider making things such that I could move searchcode.com over to use it. As such I decoupled where appropiate, merged where it would make sense and effectively laid the ground work for the work to come. With that release pushed out I am not starting to looking to once again rewrite a large chunk of searchcode.com such that it shares the same codebase.

So the things I want to improve as a list when doing this are the following.

Indexing speed
Currently the process takes literally weeks for a project to be indexed. I want to cut this down to as short a time as possible. This was due to a few reasons. The first was that I could only run a single process on each server I had. While it was possible to run multiple I noticed that it ran into issues such as overusing resources. With tight integration into the codebase I can have the processing throttle back when more important work such as a user doing a search is running. Another issue was that the external programs I was using occasionally would time out or hang. By moving that code internally I can improve searchcode server as a product and have greater control.

Real Time Index
When I first started searchcode.com I was using Sphinx with an Index + Delta scheme for updates. I moved this over to a periodic full index due to performance issues. This means that anything added to the database can take a week before it is searchable. So even if I had fixed the indexing speed issues I was still going to be hobbled by how long it took to update the index. By using sphinx real time indexes I can speed this up and have things available to search the moment they are ready. It also means that deletes which happens when people accidently publish the wrong thing accidently can be cleaned up more effectively.

Language Support
The current version of searchcode.com has some hacks on top of the perl command line application cloc to identify languages beyond what cloc supports. It works well enough but its nice to be able to control this. Since the language identification is build into searchcode server this is a rather nice bonus.

Performance
While the majority of queries and requests are returned fairly quickly some outliers take multiple seconds and some time out totally. I did some profiling of the queries themselves and it appears that the majority return in under a second. This suggest that most of the slowness comes down to the parts outside of sphinx. One thing I was concious of when writing searchcode server was to ensure that it was fast to process results. There were a lot of optimisations that went into the code to help with this. This was actually the bulk of the code that I did not want to backport into the exisitng Python codebase. There is still performance work that needs to go into this but I feel that by having a unified codebase with large amounts of usage it will be easier to catch these edge cases and resolve them more quickly.

Addtional Filters
I wanted to add addtional filter types to the frontend. Things what would be nice would be the file owner IE who was the last person to touch the file. Other things such as the cyclocmatic complexity of the file would also be useful.

With all the above in mind the work has already started. You can view the progress on https://github.com/boyter/searchcode-server/tree/searchcode and while there is still a long way to go at this point I have already got the indexing working along with search and a few other things. Still a long journey to go but its nice to see it coming along. I will be posting more updates in the future, so watch this space!

What is special about DDG

Since I am still bringing all my content together I thought I would pull in this post from Quora asking what is special about DuckDuckGo.

1. Privacy enabled by default. This certainly helped get traction when the NSA security revelations came around. DDG is not the only privacy conscious search engine but certainly one that pushes it as a feature more then others. See https://duckduckgo.com/privacy

2. !bang syntax. Remember back in the early days of Google they had a “Try this search on” and a list of search engines? !bang is that idea on steroids. This makes the cost of switching to DDG much lower then any other search engine because you are not locked in when its results are lacking.

3. Gabriel Weinburg (Creator) came up with a way to index the web for a fraction the cost of anyone else. I.E. use someone else’s index through web API’s such as Bing/Yahoo Boss. This means DDG can have an index in billions of pages without buying hundreds of machines and then crawling and indexing. Consider Cuil as an example. BTW I wrote more about this here Building a search engine? The most important feature you can add.

4. Persistence. Quite a few search engines based on Yahoo Boss and other API’s have come and gone, however DDG continues to be worked on. Just being around for 4 years gives it credibility.

5. DuckDuckHack. If you are a developer you can go to DuckDuckHack and add functionality you want. This may not sound that good, but because DDG already has traffic its a good incentive for start-ups and others to build on the DDG API to get some traction, which means they want to use DDG and promote it which fuels growth.

6. People. The people working on DDG are pretty awesome.

7. Uncluttered results. The results are pretty much just some links without too much fancy stuff going on.

Estimating Sphinx Search RAM Requirements

If you run Sphinx Search you may want to estimate the amount of RAM that it requires in order to per-cache. This can be done by looking at the size of the spa and spi files on disk. For any Linux system you can run the following command in the directory where your sphinx index(s) are located.

ls -la /SPHINXINDEX/|egrep "spa|spi"|awk '{ SUM += $5 } END { print SUM/1024/1024/1024 }'

This will print out the number of gigabytes required to store the sphinx index in RAM and is useful for guessing when you need to either upgrade the machine or scale out. It tends to be accurate to within 200 megabytes or so in my experience.

Installing Phindex

This is a follow on piece to my 5 part series about writing a search engine from scratch in PHP which you can read at http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

I get a lot of email requests asking how to setup Phindex on a new machine and start indexing the web. Since the article and code was written aimed at someone with a degree of knowledge of PHP this is somewhat understandable. What follows is how to set things up and start crawling and indexing from scratch.

The first thing to do is setup some way of running PHP and serve pages. The easiest way to do this is install Apache and PHP. If you are doing this on Windows or OSX then go and install XAMPP https://www.apachefriends.org/index.html For Linux follow whatever guide applies to your distribution. Be sure to follow the directions correctly and verify that you can create a file with php_info(); inside it which runs in your browser correctly.

For this I am using Ubuntu Linux and all folder paths will reflect this.

With this setup what you need to do next is create a folder where we can place all of the code we are going to work with. I have created a folder called phindex which I have ensured that I can edit and write files inside.

Inside this folder we need to unpack the code from github https://github.com/boyter/Phindex/archive/master.zip

boyter@ubuntu:/var/www/phindex$ unzip master.zip
Archive:  master.zip
2824d5fa3e9c04db4a3700e60e8d90c477e2c8c8
   creating: Phindex-master/
.......
  inflating: Phindex-master/tests/singlefolderindex_test.php
boyter@ubuntu:/var/www/phindex$

At this point everything should be running, however as nothing is indexed you wont get any results if you browse to the search page. To resolve this without running the crawler download the following https://www.dropbox.com/s/vf4uif4yfj8junf/documents.tar.gz?dl=0 and unpack it to the crawler directory.

boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$ tar zxvf documents10000.tar.gz
......
boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$ ls
crawler.php  documents  documents10000.tar.gz  parse_quantcast.php
boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$

The next step is to create two folders. The first is called “document” and the second “index”. These are where the processed documents will be stored and where the index will be stored. Once these are created we can run the indexer. The folders need to be created in the root folder like so.

boyter@ubuntu:/var/www/phindex/Phindex-master$ ls
add.php  crawler    index       README.md   tests
classes  documents  interfaces  search.php
boyter@ubuntu:/var/www/phindex/Phindex-master$

With that done, lets run the indexer. If you cannot run php from the command line, just browse to the php file using your browser and the index will be built.

boyter@ubuntu:/var/www/phindex/Phindex-master/$ php add.php
INDEXING 1
INDEXING 2
.....
INDEXING 10717
INDEXING 10718
Starting Index
boyter@ubuntu:/var/www/phindex/Phindex-master/$

This step is going to take a while depending on how fast the computer you are using is. Whats happening is that each of the crawled documents is processed, saved to the document store, and then finally each of the documents is indexed.

At this point everything is good. You should be able to perform a search by going to the like so,

Phindex Screenshot

At this point everything is working. I would suggest at this point you start looking at the code under the hood to see how it all works together. Start with add.php which gives a reasonable idea how to look at the crawled documents and how to index them. Then look at search.php to get an idea on how to use the created index. I will be expanding on this guide over time based on feedback but there should be enough here at this point for you to get started.

Quora answer about writing a search engine

The following I posted on Quora in response to the question “I am planning to make a small scale search engine on my local system, but I don’t know from where to start?”. It’s a reasonable answer so like my Mahalo one I thought I would make a copy for myself.

I agree with Wolf Garbe and that you are better off in your case starting with existing technologies, have a look at http://yacy.net/ and SphinxSearch as well. However if you are doing this to learn and not just deliver a product I can provide a few links for you.

For your specific questions,

1. How do I use hashing for efficient search operation ?

You are talking about having an inverted index I suspect. Have a look at the articles above which discuss the inverted index. Keep in mind you have options here. Such as inverted index or a full inverted index. The latter is useful if you want to do thinks like proximity searches and the like. For hashing itself,

Which hashing algorithm is best for uniqueness and speed?

Be careful when using hashes with URL’s. While the square root of the number
We Worship MD5, the GOD of HASH of URL’s is still a lot bigger then the current
web size if you do get a collision you are going to get pages about Britney Spears
when you were expecting pages about Bugzilla. Look into using bloom filters to avoid
these issues (assuming you get to sufficient scale).

2. How I will manage the data and ?

Up-to you. For small scale I would just use whatever database you are most familiar with. Any SQL database will scale up to hundreds of millions of records without too many issues.

3. How my searching algorithm would work ?

This is up-to you as well. You are the one in control here. Assuming you want to get something up and running as soon as possible I would do the following.

Write a simple crawler and start crawling. (for url; get url; find urls;) is about
all you need. For seeding use wikipedia’s data dump, the alexa top lists or dmoz
data.

Build a simple inverted index indexer and index as you go. Limit your index to small portions of text (title, meta tags etc..) for the moment till you get the kinks
ironed out. If your indexer is not using 100% of the CPU rethink your approach as it is wrong.

Build a simple ranker (just rank numbers of words in documents for the moment). DO NOT DO PAGE RANK! This step will save you a lot of time while getting everything else working.

Build it by default to be an OR engine (this saves you writing a query parser or
working out how to intersect two 10 million document lists quickly).

Be sure to use a stemmer from the following Stemming Algorithm. Implement a fairly large amount of stop words and ignore anything less then 3 characters in length.

The above should be enough to occupy you for several weeks at least.

Here is a link to a collection of articles on how to start building a search engine.

Want to write a search engine? Have some links

I have copied the article below, but the above link I tend to update from time to
time as I find new articles.

PHP Search Engine – Yioop!

This one is fairly fresh and talks about building and running a general purpose
search engine in PHP.

About Us – Gigablast

This has been defunct for a long time now but is written by Matt Wells (Gigablast and Procog) and gives a small amount of insight into the issues and problems he worked through while writing Gigablast.

Why Writing Your Own Search Engine Is Hard

This is probably the most famous of all search engine articles with the exception of the original Google paper. Written by Anna Patterson (Cuil) it really explores the basics of how to get a search engine up and running from crawler to indexer to
serving results.

A Conversation with Matt Wells

A fairly interesting interview with Matt Wells (Gigablast and Procog) which goes into some details of problems you will encounter running your own search engine.

Building a Search Engine

This has a few articles written about creating a search engine from scratch. It
appears to have been on hold for years but some of the content is worth reading. If nothing else its another view of someone starting down the search engine route.

blekko | spam free search

Blekko’s engineering blog is usually interesting and covers all sorts of
material applicable to search engines.

http://www.boyter.org/2013/01/co…

This is a shameless plug but I will even suggest my own small implementation. Its essentially a walk though a write of a search engine in PHP. I implemented it and it worked quite well with 1 million pages serving up reasonable results. It actually covers everything you want, Crawling, Indexing, Storing, Ranking with articles explaining why I did certain things and full source code here Phindex

The Anatomy of a Search Engine

The granddaddy of search papers. Its very old but outlines how the original version of Google was designed and written.

open-source-search-engine

Gigablast mentioned above has since become an Open source project hosted on Github. Personally I am still yet to look through the source code but you can find how to run it on the developer page and administration page.

High Scalability – High Scalability – DuckDuckGo Architecture – 1 Million Deep Searches a Day and Growing
High Scalability – High Scalability – The Anatomy of Search Technology: blekko’s NoSQL database
High Scalability – High Scalability – Challenges from large scale computing at Google
High Scalability – High Scalability – Google’s Colossus Makes Search Real-time by Dumping MapReduce
High Scalability – High Scalability – The Three Ages of Google – Batch, Warehouse, Instant

The above are fairly interesting. The blekko one is the most technical. If you only have time to read one go with the blekko one.

Another thing you might want to consider is looking through the source of existing
indexing engines like Solr and Sphinx. I am personally running through the initial
version of the Sphinx engine and will one day write a blog about how it works.

Here are a few other links (disclaimer I wrote all of those) showing how to implement the vector space model (a good thing to start with as it does ranking for you)

Vector Space Search Model Explained
Building a Vector Space Indexing Engine in Python
GoLang Vector Space Implementation
C# Vector Space Implementation

and here is a much better article which explains the math behind it,

Page on La2600

For snippet extraction I have another article here,

Building a Search Result Extract Generator in PHP

For crawling here is another one,

Why Writing a Web Crawler isn’t Easy

Lastly if you do go about and write your own search engine please write blog posts or articles about it. Its quite hard to find this sort of information, especially from the big boys (Google, Bing, Yandex, Yahoo) and I would love to see more articles about it.

Gigablast Aquired and Code Posted

Interestingly it seems that Matt Well’s search engine Gigablast has been acquired by Yippy.com [1] [2] [3] (demo here http://demo.yippy.com/). Gigablast has always been one of my favorite search engines simply because it is so interesting. Started by a single guy, with an interesting blog and being one of the last true new indexes of the web it was always worth a look. While its sad to see it go this way I am happy that Matt presumably has been able to cash out on his creation. Well done to him. I must admit Gigablast had been going downhill for a while and this might explain why ProCog appeared and then vanished so quickly.

The site is still running at this point (no idea on how long that will last) but in a totally unexpected move it is also now open source as free software (not sure if free software as no license is posted) See comments, Philippe Ombredanne has identified that its under the Apache Licence https://github.com/gigablast/open-source-search-engine/blob/master/LICENSE You can download the source on GitHub https://github.com/gigablast/open-source-search-engine. I have not had a chance to go through it yet, and the folder structure leaves much to be desired but the code is there for review.

Building a Search Result Extract Generator in PHP

During some contracting I was doing recently there was a requirement to implement some search logic using only PHP. There are no issues with that but it turns out I couldn’t find a decent extract generator handy as usually one would just plug into the search engines provided version to do this.

Off the top of my head I could only think of one example I was aware of which lives in Sphider (for the record it lives in searchfuncs.php from line 529 to 566). Sadly it has a few issues. Firstly the code is rather difficult to understand, and more importantly it usually has accuracy issues. A quick search turned up this link http://stackoverflow.com/questions/1436582/how-to-generate-excerpt-with-most-searched-words-in-php on StackOverflow. The second answer looked promising but its even more difficult to understand and a bit of profiling showed some performance issues will all of the regex going on in there.

Since I couldn’t find a solution I was happy with I naturally decided to write my own. The nice thing about reinventing the wheel is you can get a round one. The algorithm is fairly simple,

1. Identify all the matching word locations.
2. Work out a section of text that best matches the terms.
3. Based on the snip location we trim around the string ensuring we don’t skip whole words and don’t remove the last or first word if that’s the actual match.

Sounds good in theory, but lets see the results.

Sample Text

“Welcome to Yahoo!, the world’s most visited home page. Quickly find what you’re searching for, get in touch with friends and stay in-the-know with the latest news and information. CloudSponge provides an interface to easily enable your users to import contacts from a variety of the most popular webmail services including Yahoo, Gmail and Hotmail/MSN as well as popular desktop address books such as Mac Address Book and Outlook.”

Search Term “yahoo and outlook”

Sphider Snippet
“get in touch with friends and stay in-the-know with the latest news and information. CloudSponge provides an interface to easily enable your users to import contacts from a variety of the most popular webmail services including Yahoo, Gmail and Hotmail/MSN as well as popular desktop address books such as Mac Address Book and”

Stackoverflow Snippet
“Welcome to Yahoo!, the world’s most visited home page. Quickly find what you’re searching for, get in touch with friends and stay in-the-know with the latest news and information. CloudSponge provides an interface to easily enable your users to import contacts from a variety of the most…”

My Snippet
“..an interface to easily enable your users to import contacts from a variety of the most popular webmail services including Yahoo, Gmail and Hotmail/MSN as well as popular desktop address books such as Mac Address Book and Outlook.”

I consider the results to be equally good in the worst case and better in most cases I tried. I also tried each over much larger portions of text and both the Sphider and Stackoverflow seemed to produce either nothing relevant or were missing what I thought was the best match.

As always the code is on GitHib.

Why Code Search is Difficult

I was chatting with a colleague the other day and he was asking me why code search is a difficult problem. After all its not quite as dynamic as the web so it should be easier to index.

In truth its a mixed bag. Some things like like crawling are easy, others such as indexing are much harder then you would think. I thought I would write down why this is.

Thankfully crawling is one of the things that you really don’t have to put any work into. Once you have a list of sources (repo locations) you can use source control download for you. Just clone/checkout the code to disk and you are done. The only exception to this is getting the list of projects, you can crawl website (not advisable) but emailing the website owners usually gives you a hook into their data. The best thing about crawling this way is that refreshing your data is easy, just update your source or checkout/clone it again.

Indexing is where things get difficult. The first problem is knowing how to split terms. Most index engines are designed to work with words, and when it comes to words its fairly easy to determine what is considered at term by looking at spaces. In code this is far more difficult. This is best illustrated with an example. Take the following search term,

i++

and then consider the following code snippets which should all match,

for(i=0; i++; i<100) {
for(i=0;i++;i<100) {
spliti++;

How do you split them into terms? By spaces results in the following,

for(i=0; 
i++; 
i<100) 
{
for(i=0;i++;i<100) 
{
spliti++;

Using a naive split via spaces you will notice that none of the above actually match. Hence a search for i++ on the above will result in no matches. There are a few ways to get around this but suffice to say its not as straight forward as just applying Lucene/Sphinx/Solr over the data.

Compare the above search on the 3 large code search engines searchcode, github and ohloh.

Other issues to consider is that most indexers will not index special characters such as !@#$%^&*()_+[]>< These are desirable searches when looking for source code. Once you realise someone might want to search for List<Object> you realise just how difficult splitting the text and indexing characters can be.

One other issue that comes to mind is that because of the above problems your index is going to be absolutely massive. searchcode’s index on disk is actually 3x the size of the data its searching over. This isn’t a huge problem (disk space is cheap) but means you hit scaling issues far sooner then you would indexing normal text.

None of the above are insurmountable problems, but certainly something to keep in mind if you think indexing is straight forward.

Want to write a search engine? Have some links

A recent comment I left on Hacker News managed to get quite a lot of up-votes which surprised me since it was effectively just a collection of links about search engines. You can read the full thread at http://news.ycombinator.com/item?id=5129530

Anyway since it did do so well I thought I would flesh it out with some more information. Here are a collection of posts/blogs/discussions which go into the details of how to write a search engine.

http://blog.algolia.com/search-ranking-algorithm-unveiled/

Algolia is a search as a service provider which has this blog post discussing the ranking algorithm they use.

http://www.yioop.com/blog.php

This one is fairly fresh and talks about building and running a general purpose search engine in PHP.

http://www.gigablast.com/rants.html

This has been defunct for a long time now but is written by Matt Wells (Gigablast and Procog) and gives a small amount of insight into the issues and problems he worked through while writing Gigablast.

http://queue.acm.org/detail.cfm?id=988407

This is probably the most famous of all search engine articles with the exception of the original Google paper. Written by Anna Patterson (Cuil) it really explores the basics of how to get a search engine up and running from crawler to indexer to serving results.

http://queue.acm.org/detail.cfm?id=988401

A fairly interesting interview with Matt Wells (Gigablast and Procog) which goes into some details of problems you will encounter running your own search engine.

http://blog.procog.com/

Sadly it appears that this has been shut down and the content is gone. This is a new blog written by Matt Wells (Gigablast) and while there isn’t much content there I have hopes for it. Matt really does know his stuff and is promoting an open algorithm to ranking so it stands to reason there will be more decent content here soon.

http://www.thebananatree.org/

This has a few articles written about creating a search engine from scratch. It appears to have been on hold for years but some of the content is worth reading. If nothing else its another view of someone starting down the search engine route.

http://blog.blekko.com/

Blekko’s engineering blog is usually interesting and covers all sorts of material applicable to search engines.

http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

This is a shameless plug but I will even suggest my own small implementation. Its essentially a walk though a group up write of a search engine in PHP. I implemented it and it worked quite well with 1 million pages.

http://infolab.stanford.edu/~backrub/google.html

The granddaddy of search papers. Its very old but outlines how the original version of Google was designed and written.

https://github.com/gigablast/open-source-search-engine

Gigablast mentioned above has since become an Open source project hosted on Github. Personally I am still yet to look through the source code but you can find how to run it on the developer page and administration page.

http://highscalability.com/blog/2013/1/28/duckduckgo-architecture-1-million-deep-searches-a-day-and-gr.html

http://highscalability.com/blog/2012/4/25/the-anatomy-of-search-technology-blekkos-nosql-database.html

http://highscalability.com/blog/2008/10/13/challenges-from-large-scale-computing-at-google.html

http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html

http://highscalability.com/blog/2011/8/29/the-three-ages-of-google-batch-warehouse-instant.html

The above are fairly interesting. The blekko one is the most technical. If you only have time to read one go with the blekko one.

http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/

Article about using Ruby to write a small scale internet search engine. Covers crawling as well as indexing using a custom indexer in MySQL.

https://blog.twitter.com/2014/building-a-complete-tweet-index

Article from twitter about indexing the full history of tweets from 2006. Of note is the information about sharding. Due to the liner nature of the data (over time) they need a way to scale across time. Worth a look.

http://www.ideaeng.com/write-search-engine-0402

The anti write a search engine. Probably worth reading though in case you feel doing so is going to be easy.

http://lucene.sourceforge.net/talks/pisa/

A talk about the internals of Lucene. Covers some design decisions and shows the architecture that Lucene uses internally.

http://alexmiller.com/the-students-guide-to-search-engines/

Not as technical as the above can be but a good primer which covers quite a lot of history. Worth a read.

Have another one I have missed here? I would love to read it. Please add a link in the comments below.

Code a Search Engine in PHP Part 5

Are you really interested in learning how to Write a Search Engine from Scratch? Click this link and register your interest for my book about how to do so.

This is part 5 of a 5 part series.

Part 1 – Part 2 – Part 3 – Part 4 – Part 5 – Downloads/Code

So we need to convert the indexer to a method that wont consume as much memory. Looking at how it works now we can determine a few areas that could be improved before we implement out new method.

	public function index(array $documents) {
		$documenthash = array(); // so we can process multiple documents faster

		foreach($documents as $document) {
			$con = $this->_concordance($this->_cleanDocument($document));
			$id = $this->documentstore->storeDocument(array($document));

			foreach($con as $word => $count) {
				if(!$this->_inStopList($word)) {
					if(!array_key_exists($word,$documenthash)) {
						$ind = $this->index->getDocuments($word);
						$documenthash[$word] = $ind;
					}

					// Rank the Document
					$rank = $this->ranker->rankDocument($word,$document);

					if(count($documenthash[$word]) == 0) {
						$documenthash[$word] = array(array($id,$rank,0));
					}
					else {
						$documenthash[$word][] = array($id,$rank,0);
					}
				}
			}
		}

		foreach($documenthash as $key => $value) {
			usort($value, array($this->ranker, 'rankIndex'));
			$this->index->storeDocuments($key,$value);
		}
		return true;
	}

The above is our index method. The first thing I noticed is that during the main foreach over the documents we can free memory by unsetting each document after it has been processed. Doing this is a fairly simple change to the foreach loop. We just change the foreach loop to the following,

	for($i=0;$i<$cnt = count($documents); $i++) {
		$document = $documents[$i];
		unset($documents[$i]);

We are now unsetting everything as we go. This should free some memory as we go along. The next thing to do is convert it over to storing the index on disk rather then memory. I was thinking about this for a while and came up with the idea that for each new word we find we load the existing index from disk and flush it to a scratchpad. We then append the new details and keep a handle to the file. If its an existing word we append to the file. After we are done we loop over each file pointer, read the whole thing into memory, sort it and then save it into the index. Keep in mind you cannot actually keep the file pointer open at all times and just append as there is usually a limit of how many open files exist. Below is the final implementation,

	public function index(array $documents) {
		if(!is_array($documents)) {
			return false;
		}

		$indexcache = array(); // So we know if the flush file exists

		foreach($documents as $document) {
			// Clean up the document and create stemmed text for ranking down the line
			$con = $this->_concordance($this->_cleanDocument($document));

			// Save the document and get its ID
			$id = $this->documentstore->storeDocument(array($document));

			foreach($con as $word => $count) {

				if(!$this->_inStopList($word) && strlen($word) >= 2) {
					if(!array_key_exists($word, $indexcache)) {
						$ind = $this->index->getDocuments($word);
						$this->_createFlushFile($word, $ind);
						$indexcache[$word] = $word;
					}

					// Rank the Document
					$rank = $this->ranker->rankDocument($word,$document);

					$this->_addToFlushFile($word,array($id,$rank,0));
				}
			}
		}

		foreach($indexcache as $word) {
			$ind = $this->_readFlushFile($word);
			unlink(INDEXLOCATION.$word.INDEXER_DOCUMENTFILEEXTENTION);

			usort($ind, array($this->ranker, 'rankIndex'));
			$this->index->storeDocuments($word,$ind);
		}

		return true;
	}

The createFlushFile method and addToFlushFile methods are pretty much copies of the methods used in the multifolderindex class. They could presumably be combined at some point, however this is fine for the moment. This takes care of the memory issues with any luck. The results were… promising. It ended up working and using far less memory they before, but because of its constant disk thrashing ended up being more disk bound then CPU. Which is not a good sign when it comes to indexers. This is pretty easy to rectify though, we just buffer the results to flush to disk till they hit some threshold and then dump the lot to disk and start over again.

Fixing the above flaw wouldn’t take too much work, but this project has taken enough time as it is. If someone wants to fork it and fix the about feel free.

Anyway I tried the above code, and it worked. The full index took about 24 hours to build (due the issue mentioned before). Searches were reasonably fast and once I increased the amount of documents read from disk quite accurate. Some sample searches like

“source code hosting”

Turned up bitbucket as the main result. Something else like “microsoft windows” shows the official windows site at the top, followed by windows.net and various other microsoft sites.

In fact here are a few queries I tried that I found interesting to play around with, most of which produce fairly decent results.

digital asset management
ebay ireland
antivirus
celebrity gossip
weight loss
microsoft
microsoft windows
youtube converter mp3
domain registration
news for nerds
spam free search
dictionary urban
online tv free
photography review
compare hotel prices
apple mac
the white house
free ebooks for kindle
computer parts shipping
os x
kardashian
radio social network
homes for rent
iron man
wine making
apple picking
php error
rent a car

SAMPLE AND SCREENSHOT

I was going to include a link to a working sample of the site, but have currently run out of time to set it up etc… Setting it up yourself isnt too difficult however, see the instructions below in the downloads section. Here is a screenshot which shows how the whole thing running on my virtual machine.

searchscreenshot

CONCLUSION

Well there it is. As promised a full search engine (crawler, indexer, document store, ranker) over a limited set of online documents. The inital goals were to create the following,

1. WebCrawler, indexer and document storage capable of handling 1 million or so documents.
2. Use test driven development to help enforce good design and modular code.
3. Have ability to have multiple strategies for things like the index, documentstore, search etc…
4. Have something I am willing to show off in terms of code.

The first is done, without any real issues. The second I did for the first few steps (before life issues got in the way and started cheating). The third is passed with flying colours. As for the forth, well nobody is ever happy with the code the are releasing but this is good enough for the moment.

If you want to run this yourself, check out the downloads below. Download a copy of the code, do some crawling and index away. Feel free to fork the code and modify to your hearts content. Some things you might want to consider would include, removing the redundant duplicate methods that are hanging around. Things like cleanDocument is repeated all over the place and should live in a single place. Updating the code to flush to disk as mentioned above would be another good step. Modifying it to actually use the indexes third int to store metadata about the word. Fixing the indexing to remove HTML correctly, or perhaps parse out what might be interesting such as bold tags etc…

Thats just a few things off the top of me head that would work.

If you have gotten this far please check out my current project searchcode which is a source code and documentation search engine, similar to Google Code and Koders. If you want to get in contact me you can reach me on the twitter @boyter or email me at bboyte01@gmail.com

All of the downloads mentioned through these articles are here. To get started quickly, download the latest source from github, download the Top 10,000 crawled site data, unpack it to the crawler/documents/ directory and then run add.php This will index the top 10,000 sites allowing you to start searching and playing around with things.

DOWNLOADS

Source Code – https://github.com/boyter/Phindex Latest Version

Progress Code 1 – PHPSearch_Implementation_1.zip
Progress Code 2 – PHPSearch_Implementation_2.zip
Progress Code 3 – PHPSearch_Implementation_3.zip
Progress Code 4 – PHPSearch_Implementation_4.zip

Top 10,000 Crawled Sites – documents.tar.gz

Quantcast top million list – http://ak.quantcast.com/quantcast-top-million.zip