Want to write a search engine? Have some links

A recent comment I left on Hacker News managed to get quite a lot of up-votes which surprised me since it was effectively just a collection of links about search engines. You can read the full thread at http://news.ycombinator.com/item?id=5129530

Anyway since it did do so well I thought I would flesh it out with some more information. Here are a collection of posts/blogs/discussions which go into the details of how to write a search engine.

http://blog.algolia.com/search-ranking-algorithm-unveiled/

Algolia is a search as a service provider which has this blog post discussing the ranking algorithm they use.

http://www.yioop.com/blog.php

This one is fairly fresh and talks about building and running a general purpose search engine in PHP.

http://www.gigablast.com/rants.html

This has been defunct for a long time now but is written by Matt Wells (Gigablast and Procog) and gives a small amount of insight into the issues and problems he worked through while writing Gigablast.

http://queue.acm.org/detail.cfm?id=988407

This is probably the most famous of all search engine articles with the exception of the original Google paper. Written by Anna Patterson (Cuil) it really explores the basics of how to get a search engine up and running from crawler to indexer to serving results.

http://queue.acm.org/detail.cfm?id=988401

A fairly interesting interview with Matt Wells (Gigablast and Procog) which goes into some details of problems you will encounter running your own search engine.

http://blog.procog.com/

Sadly it appears that this has been shut down and the content is gone. This is a new blog written by Matt Wells (Gigablast) and while there isn’t much content there I have hopes for it. Matt really does know his stuff and is promoting an open algorithm to ranking so it stands to reason there will be more decent content here soon.

http://www.thebananatree.org/

This has a few articles written about creating a search engine from scratch. It appears to have been on hold for years but some of the content is worth reading. If nothing else its another view of someone starting down the search engine route.

http://blog.blekko.com/

Blekko’s engineering blog is usually interesting and covers all sorts of material applicable to search engines.

http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

This is a shameless plug but I will even suggest my own small implementation. Its essentially a walk though a group up write of a search engine in PHP. I implemented it and it worked quite well with 1 million pages.

http://infolab.stanford.edu/~backrub/google.html

The granddaddy of search papers. Its very old but outlines how the original version of Google was designed and written.

https://github.com/gigablast/open-source-search-engine

Gigablast mentioned above has since become an Open source project hosted on Github. Personally I am still yet to look through the source code but you can find how to run it on the developer page and administration page.

http://highscalability.com/blog/2013/1/28/duckduckgo-architecture-1-million-deep-searches-a-day-and-gr.html

http://highscalability.com/blog/2012/4/25/the-anatomy-of-search-technology-blekkos-nosql-database.html

http://highscalability.com/blog/2008/10/13/challenges-from-large-scale-computing-at-google.html

http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html

http://highscalability.com/blog/2011/8/29/the-three-ages-of-google-batch-warehouse-instant.html

The above are fairly interesting. The blekko one is the most technical. If you only have time to read one go with the blekko one.

http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/

Article about using Ruby to write a small scale internet search engine. Covers crawling as well as indexing using a custom indexer in MySQL.

http://www.ideaeng.com/write-search-engine-0402

The anti write a search engine. Probably worth reading though in case you feel doing so is going to be easy.

Have another one I have missed here? I would love to read it. Please add a link in the comments below.

Code a Search Engine in PHP Part 5

This is part 5 of a 5 part series.

Part 1Part 2Part 3Part 4Part 5Downloads/Code

So we need to convert the indexer to a method that wont consume as much memory. Looking at how it works now we can determine a few areas that could be improved before we implement out new method.

	public function index(array $documents) {
		$documenthash = array(); // so we can process multiple documents faster

		foreach($documents as $document) {
			$con = $this->_concordance($this->_cleanDocument($document));
			$id = $this->documentstore->storeDocument(array($document));

			foreach($con as $word => $count) {
				if(!$this->_inStopList($word)) {
					if(!array_key_exists($word,$documenthash)) {
						$ind = $this->index->getDocuments($word);
						$documenthash[$word] = $ind;
					}

					// Rank the Document
					$rank = $this->ranker->rankDocument($word,$document);

					if(count($documenthash[$word]) == 0) {
						$documenthash[$word] = array(array($id,$rank,0));
					}
					else {
						$documenthash[$word][] = array($id,$rank,0);
					}
				}
			}
		}

		foreach($documenthash as $key => $value) {
			usort($value, array($this->ranker, 'rankIndex'));
			$this->index->storeDocuments($key,$value);
		}
		return true;
	}

The above is our index method. The first thing I noticed is that during the main foreach over the documents we can free memory by unsetting each document after it has been processed. Doing this is a fairly simple change to the foreach loop. We just change the foreach loop to the following,

	for($i=0;$i<$cnt = count($documents); $i++) {
		$document = $documents[$i];
		unset($documents[$i]);

We are now unsetting everything as we go. This should free some memory as we go along. The next thing to do is convert it over to storing the index on disk rather then memory. I was thinking about this for a while and came up with the idea that for each new word we find we load the existing index from disk and flush it to a scratchpad. We then append the new details and keep a handle to the file. If its an existing word we append to the file. After we are done we loop over each file pointer, read the whole thing into memory, sort it and then save it into the index. Keep in mind you cannot actually keep the file pointer open at all times and just append as there is usually a limit of how many open files exist. Below is the final implementation,

	public function index(array $documents) {
		if(!is_array($documents)) {
			return false;
		}

		$indexcache = array(); // So we know if the flush file exists

		foreach($documents as $document) {
			// Clean up the document and create stemmed text for ranking down the line
			$con = $this->_concordance($this->_cleanDocument($document));

			// Save the document and get its ID
			$id = $this->documentstore->storeDocument(array($document));

			foreach($con as $word => $count) {

				if(!$this->_inStopList($word) && strlen($word) >= 2) {
					if(!array_key_exists($word, $indexcache)) {
						$ind = $this->index->getDocuments($word);
						$this->_createFlushFile($word, $ind);
						$indexcache[$word] = $word;
					}

					// Rank the Document
					$rank = $this->ranker->rankDocument($word,$document);

					$this->_addToFlushFile($word,array($id,$rank,0));
				}
			}
		}

		foreach($indexcache as $word) {
			$ind = $this->_readFlushFile($word);
			unlink(INDEXLOCATION.$word.INDEXER_DOCUMENTFILEEXTENTION);

			usort($ind, array($this->ranker, 'rankIndex'));
			$this->index->storeDocuments($word,$ind);
		}

		return true;
	}

The createFlushFile method and addToFlushFile methods are pretty much copies of the methods used in the multifolderindex class. They could presumably be combined at some point, however this is fine for the moment. This takes care of the memory issues with any luck. The results were… promising. It ended up working and using far less memory they before, but because of its constant disk thrashing ended up being more disk bound then CPU. Which is not a good sign when it comes to indexers. This is pretty easy to rectify though, we just buffer the results to flush to disk till they hit some threshold and then dump the lot to disk and start over again.

Fixing the above flaw wouldn’t take too much work, but this project has taken enough time as it is. If someone wants to fork it and fix the about feel free.

Anyway I tried the above code, and it worked. The full index took about 24 hours to build (due the issue mentioned before). Searches were reasonably fast and once I increased the amount of documents read from disk quite accurate. Some sample searches like

“source code hosting”

Turned up bitbucket as the main result. Something else like “microsoft windows” shows the official windows site at the top, followed by windows.net and various other microsoft sites.

In fact here are a few queries I tried that I found interesting to play around with, most of which produce fairly decent results.

digital asset management
ebay ireland
antivirus
celebrity gossip
weight loss
microsoft
microsoft windows
youtube converter mp3
domain registration
news for nerds
spam free search
dictionary urban
online tv free
photography review
compare hotel prices
apple mac
the white house
free ebooks for kindle
computer parts shipping
os x
kardashian
radio social network
homes for rent
iron man
wine making
apple picking
php error
rent a car

SAMPLE AND SCREENSHOT

I was going to include a link to a working sample of the site, but have currently run out of time to set it up etc… Setting it up yourself isnt too difficult however, see the instructions below in the downloads section. Here is a screenshot which shows how the whole thing running on my virtual machine.

searchscreenshot

CONCLUSION

Well there it is. As promised a full search engine (crawler, indexer, document store, ranker) over a limited set of online documents. The inital goals were to create the following,

1. WebCrawler, indexer and document storage capable of handling 1 million or so documents.
2. Use test driven development to help enforce good design and modular code.
3. Have ability to have multiple strategies for things like the index, documentstore, search etc…
4. Have something I am willing to show off in terms of code.

The first is done, without any real issues. The second I did for the first few steps (before life issues got in the way and started cheating). The third is passed with flying colours. As for the forth, well nobody is ever happy with the code the are releasing but this is good enough for the moment.

If you want to run this yourself, check out the downloads below. Download a copy of the code, do some crawling and index away. Feel free to fork the code and modify to your hearts content. Some things you might want to consider would include, removing the redundant duplicate methods that are hanging around. Things like cleanDocument is repeated all over the place and should live in a single place. Updating the code to flush to disk as mentioned above would be another good step. Modifying it to actually use the indexes third int to store metadata about the word. Fixing the indexing to remove HTML correctly, or perhaps parse out what might be interesting such as bold tags etc…

Thats just a few things off the top of me head that would work.

If you have gotten this far please check out my current project searchcode which is a source code and documentation search engine, similar to Google Code and Koders. If you want to get in contact me you can reach me on the twitter @boyter or email me at bboyte01@gmail.com

All of the downloads mentioned through these articles are here. To get started quickly, download the latest source from github, download the Top 10,000 crawled site data, unpack it to the crawler/documents/ directory and then run add.php This will index the top 10,000 sites allowing you to start searching and playing around with things.

DOWNLOADS

Source Code – https://github.com/boyter/Phindex Latest Version

Progress Code 1 – PHPSearch_Implementation_1.zip
Progress Code 2 – PHPSearch_Implementation_2.zip
Progress Code 3 – PHPSearch_Implementation_3.zip
Progress Code 4 – PHPSearch_Implementation_4.zip

Top 10,000 Crawled Sites – documents10000.tar.gz

Quantcast top million list – http://ak.quantcast.com/quantcast-top-million.zip

Code a Search Engine in PHP Part 4

This is part 4 of a 5 part series.

Part 1Part 2Part 3Part 4Part 5Downloads/Code

So previously we had 4 main issues identified. Im going to go with low hanging fruit and add a simple porn filter.

Probably the easiest way to remove porn at this point is to just blacklist it unless someone explicitly asks to see it. I had a quick look at the sort of terms that ranked for porn and came up with the following blacklist, using what I had seen and some terms from here porn blacklist keyword filter http://www.gnutellaforums.com/limewire-tips-tricks/60590-keywords-filter.html

	porn|naked|teens|pussy|sex|nasty|mature|crossdresser|couples|girlfriend|wives|pornstar|cock|fuck|shit|cunt|nude|lesbian|sexy|ass|ladyboy|granny|cum|boob|breast|exposing|milf|erotic|bdsm|live|penis|horny|slut|nudist|upskirt|boobs|tits|amateur|hottest|adult|teen|babe|1yo|2yo|3yo|4yo|5yo|6yo|7yo|8yo|9yo|10yo|11yo|12yo|13yo|14yo|15yo|16yo|17yo|incest|jailbait|kdv|kiddie|kiddy|kinder|Lolita|lsm|mbla|molested|ninfeta|pedo|phat|pjk|pthc|ptsc|premature|preteen|pthc|qsh|qwerty|r@ygold|raped|teensex|yovo|Pr0nStarS|tranny|transvest|XXX|Anal|Asshole|Bangbros|Barely|Blow|Blowjob|Bondage|brazzers|Camera_Phone|Centerfold|Clitoris|Cock|Cum|Cunt|Deepthroat|Diaper|Drilled|EROTRIX|Facial|Femjoy|Fetish|Fisting|fotos|FTV|Fuck|Gangbang|Gay|Handjob|Hardcore|Headjob|hidden_cam|Hustler|Jenna|Lesbo|Masturbat|MILF|nackte|naken|Naturals|Nipple|Nubile|Onlytease|Orgasm|Orgy|Penis|Penthouse|Playboy|Porn|Profileasian|Profileblond|Pussy|Scroops|selfpic|spunky_teens|strapon|strappon|Suck|TeenTraps|tittie|titty|tranny|transvest|twat|vagina|webcam|Whore|XPUSS|Amateur|Blonde|Brunette|Naked|Naughty|Private|Redhead|Sex|Slut|Strips|Teen|Young|wet|girl|video|taboo|nastiest

Keep in mind the above is a VERY agressive filter and will most likely filter out some non porn sites as well.

Chucking this though a simple regex inside the search class allows us to eliminate 99% of the porn results based on my simple tests. The relevent chage is below,

	preg_match_all(SEARCH_PORNFILTER, $document[0][1].$document[0][2], $matches);

	// if they want to see porn, or its not porn
	if($seeporn || count($matches[0]) <= 1) {
		$doc[] = $document;
		$count++;
		if($count == SEARCH_DOCUMENTRETURN) {
			break;
		}
	}

Pretty easy, we check if the user wants to see porn and if so just add it otherwise we check using the above blacklist. If there are 2 or more matches in the blacklist then we consider the match porn and ignore it. The seeporn variable is set in the constructor like so,

	function dosearch($searchterms,$seeporn=SEARCH_DISPLAYPORN)

Trying it out on some questionable searches shows that it works fairly well. One example I tried was “sex” and the first few results were,

http://modernman.com/ Advice and info for men on pop culture, love & sex, cars & gear, men’s health & grooming, and more.
http://thefrisky.com/ Celebrity gossip, relationship advice, sex tips and more for real women everywhere!
http://countyjailinmatesearch.com/ The county jail inmate search provides facility addresses, phone numbers and options to lookup inmates, arrest warrants and sex offenders. 1-888-786-5245.

Which looks fairly innocuous to me. *NB* This isnt implemented as some sort of moral crusade. I am just dealing with an issue that any serious search engine has to overcome at some point. Besides I didn’t want to see those results while testing as they can be distracting.

Ok moving on to the other issue which is that w searching for things like “what the internet is talking about right now” the results are very slow to return, and that we don’t get the result we would expect which would be Digg.com.

The easiest way to fix this with our current implementation is to improve our ranking algorithm. Because we are preranking documents documents such as Digg will bubble to the top of searches for terms like “what the internet is talking about right now” and should fix the issue.

Ranking however isn’t the easiet thing in the world. Google and Bing use hundreds of signals which determine a pages rank. These include terms, page speed, user behaviour, incoming links, how large the site is, terms locations, terms weight in html etc…. We have almost none of that. However everyone needs to start somewhere so here are a few ideas I had off the top of my head to use as signals.

1. If the URL contains the search term, and how much of the url it is, IE a search for “microsoft” should rank higher for “http://microsoft.com” then “http://trymicrosoftoffice.com/”

2. If the title contains the search term, and how much of the title it is. IE a search for “google email” will match the title “Gmail: Email from Google” more so then “The Tea Party” which is currently ranking above it.

3. If the meta content contains the search term rank it more highly then those where it just appears in the content.

4. Use the quantcast id as a factor in the ranking.

All of the above look good to me. Lets do the lot. Our ranker interface probably needs to have an additional method added which can rank an individual document based on the term. This is defined in the iranker like so,

	public function rankDocument($term, $document);

And now for the implementation.

	define('RANKER_TITLEWEIGHT', 1000);
	define('RANKER_URLWEIGHT', 6000);
	define('RANKER_URLWEIGHTLOOSE', 2000);
	define('RANKER_TERMWEIGHT', 100);

	public function rankDocument($term, $document) {
		$cleanurl = $this->_cleanString($document[0]);
		$cleantitle = $this->_cleanString($document[1]);
		$cleanmeta = $this->_cleanString($document[2]);
		$rank = $document[3];

		preg_match_all('/ '.$term.' /i', $cleanurl, $urlcount);
		preg_match_all('/'.$term.'/i', $cleanurl, $urlcountloose);
		preg_match_all('/ '.$term.' /i', $cleantitle, $titlecount);
		preg_match_all('/ '.$term.' /i', $cleanmeta, $pagecount);

		$words_in_url 			= count($urlcount[0]);
		$words_in_url_loose 	= count($urlcountloose[0]);
		$words_in_title 		= count($titlecount[0]);
		$words_in_meta 			= count($pagecount[0]);

		$weight = (   $words_in_meta * RANKER_TERMWEIGHT
					+ $words_in_title * RANKER_TITLEWEIGHT
					+ $words_in_url * RANKER_URLWEIGHT
					+ $words_in_url_loose * RANKER_URLWEIGHTLOOSE
				);

		// Normalise between 1 and 10 and then invert so
		// top 100 sites are 9.9 something and bottom 100 are 0.1
		$normailise = 10-(1 + ($rank-1)*(10-1)) / (1000000 - 1);
		$newweight = intval($weight * $normailise);

		return $newweight;
	}

We just break things up and based on how many terms we find increase the rank. This means that this search is keyword heavy, IE if you want to rank highly for anything just keyword stuff your URL, Title and Meta tags with the term you are targeting. The reason we normalize the URLs is to create a basic pagerank algorithm, except rather then calculate our own page rank we will use the ranks we already know about.

One other thing we can do to speed things up is to add in some stop words, and finally add a stemmer. The first is just a list of words we wont index, such as “AND” “THE” etc… A stemmer is an algorithm to reduce words to their stem. IE searches can be reduced to search since they are pretty much the same. Doing so really cuts down on our index. I took the stop word list from http://norm.al/2009/04/14/list-of-english-stop-words/ and the porter stemmer implementation from http://tartarus.org/~martin/PorterStemmer/

Of course all of this ranking means nothing for search terms such as “google mail” as it will just rank based on the last term not both terms so lets add that in. This actually turned out to be a little more complex then expected. Im not going to go through all of the changes required as I want to end these series of articles. You can however look though the code to see what was implemented.

At this point the final step really is to run though getting as much performance out of the system that we can. A few profiling checks shows that most of the time lost is done when stemming. Thankfully we can cache the results of the stem to cut this down in the indexing stage and in the results. You could also save the stemmed words in the indexing portion and trade disk space for cpu burn time. This would increase performance when it comes to searching quite a bit. Another one to consider would be storing the porn flag in the index itself. This would allow us to skip the regex test and just discard the results before pulling them even from disk. Both of the above are something to consider for the future.

So at this point everything is pretty much fine, and works with 100,000 documents. I assumed that everything should be fine. I then ran it over the full million documents and hit… an out of memory error. Essentially during the indexing stage the index filled all the available memory and crashed. Just as a test I tried it on a far more powerful machine with 16 gig of memory and hit the same issue. Obviously this needs to be fixed to reach our goal.

If you remember back in part two I highlighted that this might become and issue and had another solution to indexing out two ways to overcome it,

2. Flush outputs to disk occasionally, and post process.

Looks like we need to do this to reach our 1 million document goal, which we will do in the next and final step.

PHPSearch_Implementation_4.zip

Code a Search Engine in PHP Part 3

This is part 3 of a 5 part series.

Part 1Part 2Part 3Part 4Part 5Downloads/Code

At this point I had a large chunk of the web sitting on my disk waiting to be indexed. The first thing we need to think about is what are we going to index. Since we are dealing with HTML files we have marked up content can can pick to a certain extent what we want to index on. For our search im going to index on the title and the meta description tags. Titles are still a part of search engine ranking algorithms as many people search for things like “Yahoo” and “Facebook” and the title/URL usually contains this information. The meta description had fallen out of favor with being something search engines use these days due to being gamed, but since its unlikely the top 1 million are going to be doing this I am going to use it as well. These two terms should work fairly well at our scale and ensure our index isn’t too massive. We can always add the rest of page content at a later point.

Our first step is to loop over all of the downloaded documents, process them somehow IE extract the content we want to index and then add them into our index. Having a look at Google we can see that all we really need is the Title, Meta Description and URL to serve up a result. Lets build our documents such that they represent the above. A sample would be something like the below,

	$document = array('http://search.yahoo.com//',
					  'Yahoo! Search - Web Search',
					  'The search engine that helps you find exactly what you're looking for. Find the most relevant information, video, images, and answers from all across the Web.');

This when stored allows us to display things nicely on the page similar to Google’s results. To accommodate this lets change add.php to perform the above. The below is what I came up with.

	foreach(new RecursiveIteratorIterator (new RecursiveDirectoryIterator ('./crawler/documents/')) as $x) {
		$filename = $x->getPathname();
		if(is_file($filename)) {
			$handle = fopen($filename, 'r');
			$contents = fread($handle, filesize($filename));
			fclose($handle);
			$unserialized = unserialize($contents);

			$url = $unserialized[0];
			$content = $unserialized[1];

			// Try to extract out the title. Using a regex because its easy
			// however not correct, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for more details
			preg_match_all('/<title.*?>.*?<\/title>/i',$content, $matches);
			$title = trim(strip_tags($matches[0][0]));

			// Turns out PHP has a function for extracting meta tags for us, the only
			// catch is that it works on files, so we fake a file by creating one using
			// base64 encode and string concaternation
			$tmp = get_meta_tags("data://$mime;base64,".base64_encode($content));
			$desc = trim($tmp['description']);

			// This is the rest of the content. We try to clean it somewhat using
			// the custom function html2text which works 90% of the tiem
			$content = trim(strip_tags(html2txt($content)));

			// If values arent set lets try to set them here. Start with desc
			// using content and then try the title using desc
			if($desc == '' && $content != '') {
				$desc = substr($content,0,200).'...';
			}
			if($title == '' && $desc != '') {
				$title = substr($desc,0,50).'...';
			}

			// If we dont have a title, then we dont have desc or content
			// so lets not add it to the index
			if($title != '') {
				$toindex[] = array($url, $title, $desc);
			}
		}
	}

This needs some explaining. The foreach essentially recursively iterates over our crawled documents. We point at the directory where they live rather then copy them somewhere (why should be bother moving them?). We then check that we are looking at a file, and if so open it up, unserialize the contents and extract the URL and the content. We then go about parsing out the stuff we want. Firstly the title which we extract using a simple regex. Keep in mind parsing HTML with regex is a very bad idea. Thankfully on this sample set it works well enough that we can get away with it. Ideally we should use a XML parser. The next thing we pull out is the meta tags IE description. Thankfully this functionality already exists in PHP. Lastly we stuff the stripped content into a variable. The next step we check if there is anything missing, and assuming we have a title add it to our list to index. The final step of course would be to actually index everything.

The next thing we need to do is modify our indexer. I created a new indexer called indexer based on the slightlynaieveindexer. The only real change other then the name at this point is inside the cleanDocument function as the below.

	public function _cleanDocument($document) {

		$contents = $document[0].' '.$document[1].' '.$document[2];

		$cleandocument = strip_tags(strtolower($contents));
		$cleandocument = preg_replace('/\W/i',' ',$cleandocument);
		$cleandocument = preg_replace('/\s\s+/', ' ', $cleandocument);
		if($cleandocument != ''){
			return explode(' ',trim($cleandocument));
		}
		return array();
	}

The change is that we concatenate the 3 passed in fields URL, Title, Description and then build our concordance based on that.

So with that all done running it about 5000 documents works without to many issues. I modified search.php at this point to look like the below,

	echo '<ul style="list-style:none;">';
	foreach($search->dosearch($_GET['q']) as $result) {
		?>
		<li>
			<a href="<?php echo $result[0][0]; ?>"><?php echo $result[0][1]; ?></a><br>
			<a style="color:#093; text-decoration:none;" href="<?php echo $result[0][0]; ?>"><?php echo $result[0][0]; ?></a>
			<p><?php echo $result[0][2]; ?></p>
		</li>
		<?php
	}
	echo '</ul>';

which displays the results in a similar manner to Google. Of course we still have the same issue where if you search for something like “the” you get 5000 links on the page. Lets fix this issue.

There are a few things to keep in mind with this. When you search for a term you essentially load that term’s list of documents into memory. This can take some time because you are loading a file from disk and processing it. One thing you can do to speed this up is only read the top 1000 or so documents from the file. This is slightly problematic however when it comes to ranking. If someone searches for “Yahoo” and the Yahoo homepage is document 1001 in your documents it will never appear in your results. One possible solution is to sort the documents in the index by rank against the term. IE we know that the Yahoo homepage will rank very highly for the term “Yahoo” so lets make it first in line. This way when we pull back 1000 results we know that Yahoo will be at the top. While we are at it we should probably split the terms into separate folders as well. The new index we will call multifolderindex. Changing the index to work like this is very simple. We just modify the getFilePathName like so,

	public function _getFilePathName($name) {
		$md5 = md5($name);
		$one = substr($md5,0,2);
		mkdir(INDEXLOCATION.$one.'/');
		return INDEXLOCATION.$one.'/'.$name.SINGLEINDEX_DOCUMENTFILEEXTENTION;
	}

Which then creates the folders (if required) and returns the location correctly. Nothing complex there. The other change is to limit the number of documents pulled back for each search. To do so we edit getDocuments and add the following,

	$count++;
	if($count == 200) {
		break;
	}

With the variable $count initialized with 0 outside the loop we now cut down on the processing time in a large way. In the future this should be passed in as an offset so we can step though results IE get deeper results, but it should be fine for the moment. *NB* For the final implementation I upped this number to 10,000 and got good results.

No in order for the above changes to work we when need to modify the indexer to presort our documents before passing them off to be saved. This is simple, we just pass in our existing ranker function have it rank the documents and we are done. The line to modify is inside index in our indexer,

	foreach($documenthash as $key => $value) {
		usort($value, array($this->ranker, 'rankDocuments'));
		$this->index->storeDocuments($key,$value);
	}

Of course we need to add ranker into the constructor as well. Now when the index hits disk it will already be presorted in favor of documents which have the word more then other documents. Later on we will adjust this to allow for the word being in the title or the URL and rank based on that. Since we are modifying a lot of stuff lets make things by default an AND engine. This involves the following change in our search class.

	function dosearch($searchterms) {
		$indresult = array(); // AND results 
		$indorresult = array(); // OR results IE everything

		$interlists = array();

		foreach($this->_cleanSearchTerms($searchterms) as $term) {

			$ind = $this->index->getDocuments($term);
			if($ind != null) {
				$tmp = array();
				foreach($ind as $i) {
					$indorresult[$i[0]] = $i[0];
					$tmp[] = $i[0];
				}
				$interlists[] = $tmp;
			}
		}

		// Get the intersection of the lists
		$indresult = $interlists[0];
		foreach($interlists as $lst) {
			$indresult = array_intersect($indresult, $lst);
		}

		$doc = array();
		$count = 0;
		foreach($indresult as $i) {
			$doc[] = $this->documentstore->getDocument($i);
			$count++;
			if($count == 20) {
				break;
			}
		}
		if($count != 20) { // If we dont have enough results to AND default to OR
			foreach($indorresult as $i) {
				$tmp = $this->documentstore->getDocument($i);
				if(!in_array($tmp, $doc)) { # not already in there
					$doc[] = $tmp;
					$count++;
					if($count == 20) {
						break;
					}
				}
			}
		}

		return $doc;
	}

There is a lot going on in there but essentially we build the list as before (the OR list) but also keep each set desperately. We then take each one of those sets and get the intersection of them all. This is used as the top results, and assuming we don’t find enough revert to OR logic. Just something to keep in mind its actually easier to get AND logic when you sort lists by Id’s. You can just step through each list popping id’s on either side till you find enough intersecting results.

The final thing to update before testing is to move the document store over to a multi folder document store. Thankfully changing this one is pretty easy as well as the design is similar to the multi folder index, all we need do is update the getFilePathName method, which looks like the following.

	public function _getFilePathName($name) {
		$md5 = md5($name);
		$one = substr($md5,0,2);
		mkdir(DOCUMENTLOCATION.$one.'/');
		return DOCUMENTLOCATION.$one.'/'.$name.DOCUMENTSTORE_DOCUMENTFILEEXTENTION;
	}

INITIAL RESULTS

With that done the document store and index are good enough to try some much larger searches. At the time I had about 20,000 documents that I could index, so I indexed the lot and had a few tests. A few thoughts from this.

1. When searching for things like “what the internet is talking about right now” the results are very slow to return.
2. When searching for things like the above we don’t get the result we would expect which would be Digg.com
3. The ranking can still be pretty stupid at times for example “reverse image search” ranks emailfinder.com above tineye.com
4. There is a lot of porn results returned which are rather annoying, for example “flat”

All of the above need to be resolved before we can say that this implementation works well and will be the focus of Part 4.

You can download what we have from here.

PHPSearch_Implementation_3.zip

Code a Search Engine in PHP Part 2

This is part 2 of a 5 part series.

Part 1Part 2Part 3Part 4Part 5Downloads/Code

The first implementation has a few issues that are pretty apparent. Chiefly is performance. Secondly it stores the index in a single folder which is loosely related to the first. Finally it shows duplicates in the search results. Lets fix the above and test things again to see how far we can go.

PERFORMANCE

Not sure what video I was watching but it had the following line about performance “Don’t ever guess. You will be wrong. Measure.” In my opinion (whatever that is worth) this is 100% correct. If you aren’t a human compiler/runtime (and if you are why are you reading this?) there is no way you can know what is actually slow in your application. You can guess, and you might be right, but generally its faster to run your application through a profiler and you quite often will be surprised.

* One example came when I was profiling http://searchcode.com/ and trying to get the result templates to load faster. Considering it does all sorts of regex/formatting/etc… I expected the issue to be string processing. What actually turned out to be was a method I was calling to get the root location of the website IE http://searchcode.com/ it was called many times and ate most of the time. Simply calling it once and storing the value in a variable decreased page load times by about 30% *

PHP actually does have some decent profiling tools even if they can be a pain to setup. My personal choice is to use webgrind because its pretty easy to setup and understand. I did some profiling of the indexing using the following situations.

1. Empty index. Add/index single document.
2. 100 documents in index. Add/index single document.
3. 1000 documents in index. Add/index single document.

The results showed that most of the time is spent writing things to disk, mostly in the building of the index rather then saving the documents themselves. This is because for each document it loads the index shard from disk, appends to it and then saves it without considering it might need to do it again really soon. This means a single index shard for a common word such as “the” will be loaded and saved to disk once for each document added. When doing 10,000 documents this is a massive bottleneck.

INDEX IMPROVEMENTS

We need to consider how to spend less time writing or reading to disk while building the index. Off hand I can think of a few ways, of which we will have a look at two for the moment.

1. Keep the index in memory. Save every now and then or once at the end of the indexing process.

This approach has a few issues. The main one being if we are indexing a popular word such as “the” the index is going to get VERY large very quickly and consume all available memory. Stopwords https://en.wikipedia.org/wiki/Stop_words would fix this issue for a while, but eventually we are going to need more memory then we can fit in a machine. Sure we could only take this approach for words which are not common and keep the common words using the existing approach, but we won’t get much performance gain as its the frequent reads/writes which are killing performance. That said I think its worth at least trying, as it should be very easy to implement. It may also get us over the million document mark which means we can spend more time focused on cool ranking algorithms.

2. Flush outputs to disk occasionally, and post process.

This approach is more scale-able, but has the issue that the index is not searchable while we are building it. I think this one is the correct approach though as the previous one will probably run out of steam when we look at indexing a million documents. Essentially we have a two step index process. The first is where we add documents to the index. The process holds a file lock over a file and every now and then pushes more and more of the index to disk. As its a single file, and because we are flushing to disk every now and then performance should be reasonably fast and not constrained by the disk too much. When we have finished adding documents to the disk we run a post process over the new file to build our searchable index. Of course this is all in theory, the only way to be sure is to try it and benchmark. Assuming the first option fails that is.

OPTION 1 A SLIGHTLY NAIVE INDEXER

So in the list of improvements I identified two ways to speed things up and increase our ability to scale. The first being holding the index in memory while processing and flushing it to disk at the end of the process. Thankfully we already catered for this in the design. When we call the indexers index method we don’t have to pass one document, we can actually pass multiple. To implement the in memory saving we just need to change the indexer method to look like the below.

	public function index(array $documents) {
		if(!is_array($documents)) {
			return false;
		}

		$documenthash = array(); // so we can process multiple documents faster

		foreach($documents as $document) {

			// Save the document and get its ID then clean it up for processing
			$id = $this->documentstore->storeDocument(array($document));
			$con = $this->_concordance($this->_cleanDocument($document));

			foreach($con as $word => $count) {
				// Get and cache the word if we dont have it
				if(!array_key_exists($word,$documenthash)) {
					$ind = $this->index->getDocuments($word);
					$documenthash[$word] = $ind;
				}

				if(count($documenthash[$word]) == 0) {
					$documenthash[$word] = array(array($id,$count,0));
				}
				else {
					$documenthash[$word][] = array($id,$count,0);
				}
			}
		}

		foreach($documenthash as $key => $value) {
			$this->index->storeDocuments($key,$value);
		}

		return true;
	}

The changes here are to store all the index shards in memory using $documentcash and then at the end dump them all to disk. This means rather then reading a writing to disk all the time we only do it in a single step per index. Of course it does mean we need to change our additions to the index like the below.

	$indexer->index(
		array(
			'Setting the AuthzUserAuthoritative directive explicitly to Off allows for user authorization to be passed on to lower level modules (as defined in the modules.c files) if there is no user matching the supplied userID.',
			'The Allow directive affects which hosts can access an area of the server. Access can be controlled by hostname, IP address, IP address range, or by other characteristics of the client request captured in environment variables.',
			'This directive allows access to the server to be restricted based on hostname, IP address, or environment variables. The arguments for the Deny directive are identical to the arguments for the Allow directive.',
			'The Order directive, along with the Allow and Deny directives, controls a three-pass access control system. The first pass processes either all Allow or all Deny directives, as specified by the Order directive. The second pass parses the rest of the directives (Deny or Allow). The third pass applies to all requests which do not match either of the first two.',
			'The AuthDBDUserPWQuery specifies an SQL query to look up a password for a specified user.  The users ID will be passed as a single string parameter when the SQL query is executed.  It may be referenced within the query statement using a %s format specifier.',
			'The AuthDBDUserRealmQuery specifies an SQL query to look up a password for a specified user and realm. The users ID and the realm, in that order, will be passed as string parameters when the SQL query is executed.  They may be referenced within the query statement using %s format specifiers.'
		)
	);

All of the documents are now indexed in a single step. Of course with such a small case such as our 6 documents the overhead is very little. However when you compare this implementation with 100 or even 10,000 documents its considerably faster. It still slows down as you add more documents however. IE adding 1 document to an index containing 10,000 documents is going to be slower then 1 document to an index with 10.

A few checks of the above showed that we could index 200 documents considerably faster in a single sweep rather then a single document each time. Of course adding the next 200 documents was much slower then the first 200 but there is a huge improvement here. The nice thing is that if we just add 10,000 documents in a single go its just as fast as 1 as the speed penalty is on the initial read and the final write.

The above change is probably enough for us to go and test with. Assuming it holds together to 100,000 documents it could probably hold together to 1,000,000 documents.

BENCHMARKING

So with the above simple fix lets load it up and see how it goes. I ran the following code to input 100,000 documents and it managed to finish in about 20 minutes (on a very under-powered virtual machine running in VirtualBox).

	$toindex = array();
	for($j=0;$j<100000; $j++) {
		$toindex[] = 'The AuthDBDUserRealmQuery specifies an SQL query to look up a password for a specified user and realm. The users ID and the realm, in that order, will be passed as string parameters when the SQL query is executed.  They may be referenced within the query statement using %s format specifiers.';
	}
	$indexer->index($toindex);

Keep in mind this is saving the documents to disk as well which we will remove in our final version and the performance isn’t too bad.

Searches work as well, however you need to add a time limit of 0

	set_time_limit(0);

To avoid hitting timeouts. This is because any search actually then goes on to display 100,000 results.

Im going to say that this totally surprised me. I really did expect that we would have to explore better ways of saving the index to disk. Since it does work however and since its highly likely that it will hold together for the full 1,000,000 documents lets move onto more interesting code.

A BETTER SEARCH

One of the other issues we have is where duplicate results are returned when searching for more then one term IE a search for “AuthDBDUserRealmQuery SQL” will return the single document which contains AuthDBDUserRealmQuery twice. Lets implement some logic to remove duplicate documents from being returned.

The implementation is basically a copy of naievesearch and we modify the dosearch method like so

	function dosearch($searchterms) {
		$indresult = array();
		foreach($this->_cleanSearchTerms($searchterms) as $term) {

			$ind = $this->index->getDocuments($term);
			if($ind != null) {
				usort($ind, array($this->ranker, 'rankDocuments'));
				foreach($ind as $i) {
					$indresult[$i[0]] = $i[0];
				}
			}
		}

		$doc = array();
		foreach($indresult as $i) {
			$doc[] = $this->documentstore->getDocument($i);
		}

		return $doc;
	}

In this case all we need to do loop through each of the results for each term, sort them and add them to a hash based on the id. This ensures we get no duplicate results. Because the sorting takes place for each term however documents are ranked according to their frequency for the first term, then the second etc… Not ideal but good enough for the moment. This also means our engine is still an OR engine by default.

To convert it to an AND engine the logic is similar and fairly simple assuming the lists are sorted. For each document list peek at the top document. If the document id exists in all of the lists that document is a match so remember it. If not pop the smallest id and check of they match. Repeat till you run out of documents (or hit some limit) and then rank and display.

INITIAL RESULTS

At this point everything is going well. We can index 100,000 documents without any major issues. We know that the search holds together at this point, so lets assume that everything should be good for us to hit out 1,000,000 document goal.

PHPSearch_Implementation_2.zip

This is the end of Part 2. In Part 3 we are going to modify out document store to use our downloaded documents, modify out indexer to deal with web documents and finally modify our ranker to rank with a little more intelligence.

Code a Search Engine in PHP Part 1

This is part 1 of a 5 part series.

Part 1Part 2Part 3Part 4Part 5Downloads/Code

I imagine that if you have landed on this page you probably have an interest in search and search engines. Not only that, you have probably read the following Why Writing Your Own Search Engine is Hard by Anna Patterson (of Cuil fame) http://queue.acm.org/detail.cfm?id=988407 If not, go read it and come back here. I have always had an interest in search and indexing I thought I would take the time to write a simple indexer as a way of testing if I had actually learnt anything.

Ok let’s do it. Let’s write a small search engine. By small I mean let’s make it so we can index about a million documents and search on them. I say only a million because you can pretty easily download this number of documents and store them on a single machine these days.

First things first let’s state our goals so we know what we are aiming for.

1. WebCrawler, indexer and document storage capable of handling 1 million or so documents.
2. Use test driven development to help enforce good design and modular code.
3. Have ability to have multiple strategies for things like the index, documentstore, search etc…
4. Have something I am willing to show off in terms of code.

To start what we need is some thought and design. I suspect that pretty much any search engine you name started with some simple ideas which were then executed on. First thing I am going to choose is what language I’m going to write things in. I considered doing this in Java, C# or Python, but in the end decided to go with PHP for a few reasons. The first being that there isn’t really an existing implementation out there for PHP and more importantly I want to use this as a way of proving I have some PHP skills.

A search engine really only has a few parts. A crawler to pull external documents down ready to be indexed, an index which is where the documents are stored in an inverted tree and a documentstore for keeping the documents.

For the WebCrawler I am just going to use a very simple FOR(X) GET X; simple PHP script. This should be more than enough to get us started indexing. There are other crawlers out there writing in PHP such as Sphider, but really we only need something simple.
For the actual implementation of the index and documentstore I’m going to build the following parts with each designed to be independent of others.

index – This is going to handle creating and reading the “fast” lookups.
documentstore – This is going to handle the management of documents.
indexer – This is going to be responsible for taking a document, breaking it apart and passing it off to the index and possibly documentstore for storage.
search – This is going to be responsible for parsing the query, querying the index and documentstore to get the results
ranker – This is going to be responsible for ranking the results somehow and will be used by search to order results

That should be pretty much all we need to download a portion of the web, start indexing and serve up some results.

THE CRAWLER

In order to crawl you need to come up with a list of URL’s. There are a few common ways to do this. The first is to seed your crawler with a few links which contain lots of other links (eg Yahoo, Digg, Slashdot, HackerNews) crawl them and harvest as you go. Another is to download a list of millions of URL’s and use that. Off the top of my head I can think of two, the first being DMOZ.org. You can download their human created list of URL’s for free. The other is to download this list from Quantcast of the top 1 million websites. For this im going to go with the Quantcast option simply because for basic searches such as “Youtube” “Yahoo Finance” it should have the results we need.

Downloading the file shows it to be in the following format,

# Quantcast Top Million U.S. Web Sites
# Rankings estimated as of Nov 27, 2012
# Copyright 2012 Quantcast Corporation
# This data is subject to terms of use described at http://www.quantcast.com/docs/measurement-tos

Rank    Site
1       google.com
2       youtube.com
3       facebook.com
4       msn.com
5       twitter.com
6       yahoo.com
7       amazon.com
8       wikipedia.org
9       microsoft.com
10      huffingtonpost.com

Since we are only interested in the actual sites lets write a simple parser to pull the correct data out. Its a fairly simple script really,

	$file_handle = fopen("Quantcast-Top-Million.txt", "r");

	while (!feof($file_handle)) {
		$line = fgets($file_handle);
		if(preg_match('/^\d+/',$line)) { # if it starts with some amount of digits
			$tmp = explode("\t",$line);
			$rank = trim($tmp[0]);
			$url = trim($tmp[1]);
			if($url != 'Hidden profile') { # Hidden profile appears sometimes just ignore then
				echo $rank.' http://'.$url."/\n";
			}
		}
	}
	fclose($file_handle);

All the above does is read in the file line by line, checks if it starts with a digit IE 0-9 and if so splits the line based on tabs and prints out the rank and the url. We keep the rank because that may be useful down the line when indexing. Using some basic shell scripting we can run it like so

	php parse_quantcast.php > urllist.txt

It takes a few minutes to run but at the end we have our seed crawl list, which should be almost the same size as the inital file. I wanted to see what we were working with following this step and ran a few checks on the file,

	wc -l urllist.txt
	995941 urllist.txt

	cat urllist.txt | sort | uniq | wc -l
	995941

So there are 995941 lines in the file, and there are 995941 unique lines in there as well so a unique URL for each line. Perfect.

The other option using the DMOZ data can be parsed using the following steps. Go download the content (about 300meg) unpack it and you can begin collecting the links. I did look into using grep’s -o functionality but couldn’t get it to accept a regex allowing me to pull out just the links I wanted. I was then going to write a parser in PHP but since Python just eats these sort ot jobs, in end I used a small Python script,

	grep "http" content.rdf.u8 | python parse.py | sort | uniq > urllist.txt

Where parse.py is just the following,

	import re
	import sys

	match = re.compile("""http.*\"""")

	for line in sys.stdin:
		t = match.findall(line)
		if len(t) != 0:
			print t[0].replace('"','')

The above DMOZ data gives about 4 million unique urls to crawl and use. Sadly a lot of the stuff in there is no longer current or points to sites people aren’t interested in so I will stick with the quantcast list.

DOWNLOADING

Downloading the data is going to take a while, so be prepared for a long wait. You can write a very simple crawler in PHP simply by using a file_get_contents and sticking in a url. I like simple, so lets do that.

	$file_handle = fopen("urllist.txt", "r");

	while (!feof($file_handle)) {
		$url = trim(fgets($file_handle));
		$content = file_get_contents($url);
		$document = array($url,$content);
		$serialized = serialize($document);
		$fp = fopen('./documents/'.md5($url), 'w');
		fwrite($fp, $serialized);
		fclose($fp);
	}
	fclose($file_handle);

The above essentially is a stupid singlethreaded crawler. It just loops over each url in the file, pulls down the content and saves it to disk. The only thing of note here is that it stores the url and the content in a document because we might want to use the URL for ranking and because its helpful to know where the document came from. One thing to keep in mind about the above is you may run into filesystem limits when trying to store a million documents in one folder.

At this point I quickly found is that the crawler is going to take forever to download even a small portion of the web. The reason being it is single threaded. We can fix this pretty easily with some xargs and command line fu. Another issue is that dropping about a million files into a single directory as I mentioned before is probably a bad plan. So I am going to use the hash of the file to spread it out over directories. This would also make it easier to shard in the future should the need arise.

*NB* Be careful when using hashes with URL’s. While the square root of the number http://www.skrenta.com/2007/08/md5_tutorial.html of URL’s is still a lot bigger then the current web size if you do get a collision you are going to get pages about Britney Spears when you were expecting pages about Bugzilla. Its probably a non issue in our case, but for billions of pages I would opt for a much larger hashing algorithm such as SHA 256 or avoid it altogether.

First we change the crawler to be the below.

	$url = trim($argv[1]);

	$md5 = md5($url);
	$one = substr($md5,0,2);
	$two = substr($md5,2,2);

	if(!file_exists('./documents/'.$one.'/'.$two.'/'.$md5)) {
		echo 'Downloading - '.$url."\r\n";
		$content = file_get_contents($url);
		$document = array($url,$content);
		$serialized = serialize($document);
		mkdir('./documents/'.$one.'/');
		mkdir('./documents/'.$one.'/'.$two.'/');
		$fp = fopen('./documents/'.$one.'/'.$two.'/'.$md5, 'w');
		fwrite($fp, $serialized);
		fclose($fp);
	}

This simply checks for the arguments its supplied and expects the first argument (the one following the filename itself) to be the URL its going to download. It then hashes the URL so we know the filename and which directories need to be created. We then check if the file and save location exists, and if it dosn’t we then crawl the site, create the directories and save the content to disk.

Then with the following,

	cat urllist.txt | xargs -L1 -P32 php crawler.php

We have a 32 process crawler. You can spawn more process’s if you wish, just increase the -P32 to whatever number you prefer. Because a crawler spends most of its time making external connections waiting on network you can bump the number quite high, certainly 100 or so on a single core machine should be fine, however that may crash your router (like it did to mine) so be careful and keep an eye on it. After crashing my router a few times I ran the process on a proper server and let it run over a day or so.

Run it for a few days or however long it takes and you will have a large chunk of web data sitting on your disk ready for indexing.

You may want to split the urllist using “split -l 10000″ or some other line size and pull down each one separately. This might also be an idea if you are sharing your internet connection and don’t want to saturate it totally for days on end.

One thing to keep in mind with this is that dumping a million documents all over the disk is terribly inefficient. You should be laying down the documents in large files and keeping the locations of said documents. This makes processing much faster as you can access the disk in a long seek (although for a SSD this is probably a moot point). For a real search engine IE billions of documents do not do the above. However for something small scale this is perfectly fine.

With that done I think we can cross off the crawler. Real world crawling of course isn’t quite that simple. Eventually you need to consider refreshing your content, finding new links, avoiding overtaxing servers (which might have multiple domains on it), avoiding crawling admin pages which have delete links, improving performance (caching DNS lookups) etc… but for our purposes to refresh the index you can just re-crawl things the same way, and for a small portion of the web it should be fine.

THE INDEX

When I approach things in a test driven development manner I like to use a bottom up approach. It makes sense to me to begin with the feature that is least coupled to everything else. The standard search engine index is usually an inverted index of which there are two main types. The first is a record level index which contains a list of references to documents for each word/key. The second is a full inverted index which includes the locations/positions of each word in the document. For the sake of simplicity I am going to build the first. This means we cannot do a fast proximity search for terms, but we can always add this later.

I decided that the index I was going to create should only have a few very simple responsibilities. The first is that it needs to store its contents to disk and retrieve them. It also needs to be able to clear itself when we decide to regenerate things. The last thing that would be useful is to have it validate documents that its storing. With these simple tasks defined I wrote the following interface,

	interface iindex {
		public function storeDocuments($name,array $documents);
		public function getDocuments($name);
		public function clearIndex();
		public function validateDocument(array $document);
	}

which does pretty much every task I require. The reason for the interface is that I can write unit tests against just the interfaces without worrying about the implementation, and if required change how the backend works. Now we have what we are going to code against we need to have a long think about how we are going to implement this.

For storing the documents I decided to go with the simplest thing that could possibly work which is a single folder index where each index file is stored in a separate file in a single directory. The documents in the index will be represented by 3 integers in an array like so,

	array(inta,intb,intc);

This gives us enough space to store a document id for lookups, and another two spaces to store a simple pre-rank, word count or possibly use some bit logic to store 32 or so signals when indexing (more on this in the indexing part). This means that each single document we are going to store requires 12 bytes. Lets assume that a document contains about 500 unique keywords and we can use some simple math to guess the size of the index,

	((12 x 1,000,000 x 500) / 1024) / 1024 = 5722 Megabytes

almost 6 gigabytes which isn’t an outlandish size, and certainly easily stored on a single machine. Even if we double the index and even the assumed number of unique keywords we are still looking at less than 30 gigabytes which is easily done. Keep in mind however that we are writing a very slow index solution at this point and its unlikely that it will scale to this size. We will adjust it later to cope with this size.

So having defined the interface and how the documents look on disk lets look into implementing this. Im not going to refer to the tests here, but suffice to say the whole thing was developed using them in your standard TDD approach. The first method I think we needed to tackle is validatedocument.

	define('SINGLEINDEX_DOCUMENTCOUNT', 3);

	public function validateDocument(array $document=null) {
		if(!is_array($document)) {
			return false;
		}
		if(count($document) != SINGLEINDEX_DOCUMENTCOUNT) {
			return false;
		}
		for($i=0; $i < SINGLEINDEX_DOCUMENTCOUNT;$i++) {
			if(!is_int($document[$i]) || $document[$i] < 0) {
				return false;
			}
		}
		return true;
	}

Essentially this is what I have come up. Basically it just checks that the document supplied is an array and that it contains 3 integers which are greater then zero. The next function we need is storeDocuments.

	public function storeDocuments($name, array $documents = null) {
		if($name === null || $documents === null || trim($name) == '') {
			return false;
		}
		if(!is_string($name) || !is_array($documents)) {
			return false;
		}
		foreach($documents as $doc) {
			if(!$this->validateDocument($doc)){
				return false;
			}
		}
		$fp = fopen($this->_getFilePathName($name),'w');
		foreach($documents as $doc) {
			$bindata1 = pack('i',intval($doc[0]));
			$bindata2 = pack('i',intval($doc[1]));
			$bindata3 = pack('i',intval($doc[2]));
			fwrite($fp,$bindata1);
			fwrite($fp,$bindata2);
			fwrite($fp,$bindata3);
		}
		fclose($fp);
		return true;
	}

The above is what I came up with. It takes in an name (the word we are storing) and an array of documents to store. Essentially all its doing is writing the integers (4 bytes each) to a binary file. The only unusual thing happening is the new method _getFilePathName which when supplied a name (the word we are storing) returns a string which contains where the index is to be stored and its name. This function just looks like this,

	public function _getFilePathName($name) {
		return INDEXLOCATION.$name.SINGLEINDEX_DOCUMENTFILEEXTENTION;
	}

Where INDEXLOCATION needs to be defined somewhere (enforced in the constructor) and the file extention is in the index file. The reason for the INDEXLOCATION to be defined outside this file is that we don’t want to work it out ourselves but have the developer using this code define where the index should live.

The next method we need is to get a document list, getDocuments.

	public function getDocuments($name) {
		if(!file_exists($this->_getFilePathName($name))) {
			return array();
		}
		$fp = fopen($this->_getFilePathName($name),'r');
		$filesize = filesize($this->_getFilePathName($name));
		if($filesize%SINGLEINDEX_DOCUMENTBYTESIZE != 0) {
			throw new Exception('Filesize not correct index is corrupt!');
		}
		$ret = array();
		for($i=0;$i<$filesize/SINGLEINDEX_DOCUMENTBYTESIZE;$i++) {
			$bindata1 = fread($fp,SINGLEINDEX_DOCUMENTINTEGERBYTESIZE);
			$bindata2 = fread($fp,SINGLEINDEX_DOCUMENTINTEGERBYTESIZE);
			$bindata3 = fread($fp,SINGLEINDEX_DOCUMENTINTEGERBYTESIZE);
			$data1 = unpack('i',$bindata1);
			$data2 = unpack('i',$bindata2);
			$data3 = unpack('i',$bindata3);
			$ret[] = array($data1[1],
							$data2[1],
							$data3[1]);
		}
		fclose($fp);
		return $ret;
	}

Its actually more complex then the write function because we need to know the number of bytes to get. An integer on a 32 bit x86 system in PHP is 4 bytes. So a document consists of 12 bytes since we said it would have 3 integers. We then just suck it all into a large array and return the whole thing. I did add a simple naive corrupt index check here just to make sure that we don’t have any unexpected issues caused by bad writes. Of course this just means it will throw corrupt errors which isn’t much better.

The last method defined is the clear index. This is the simplest function to write in this case and looks like the below,

	public function clearIndex() {
		$fp = opendir(INDEXLOCATION);
		while(false !== ($file = readdir($fp))) {
			if(is_file(INDEXLOCATION.$file)){
				unlink(INDEXLOCATION.$file);
			}
		}
	}

It just loops over each file in the index directory and deletes them all. Simple stuff really.

Phew! not too much code with the whole file just being over 100 lines, and provides us something that will store an index and pull it back for us. I actually ended up writing 300 lines of test code to really stress this logic and make sure everything works as expected and it seems pretty solid. Some of the tests are pure unit tests and a lot actually write through to disk. Not pure unit tests, but since they are pretty fast we can consider them unit enough. Besides without mocking the file-system this class would be pretty stupid to test otherwise.

THE DOCUMENT STORE

The document store is a somewhat unusual beast in that odds are if you are going to index things you probably already have what you wanted stored somewhere. The most obvious case being documents in a database, or documents that already exist. This means we really do need an interface so we can change the underlying structure of the document store as we go on without having to rewrite things over and over. Hence I came up with the following,

	interface idocumentstore {
		public function storeDocument(array $document);
		public function getDocument($documentid);
		public function clearDocuments();
	}

Nice and simple. Store a document which is just an array of values and returns a document id, get a document by id and clear all the documents stored. The implementation is similar to index store above and the key parts included below. For testing purposes I created a very simple single folder document store file. I will use this implementation to get everything working at first and then hook into our crawlers downloaded documents.

storeDocument

	$docid = $this->_getNextDocumentId();
	$serialized = serialize($document);
	$fp = fopen($this->_getFilePathName($docid), 'a');
	fwrite($fp, $serialized);
	fclose($fp);
	return $docid;

getDocument

	$filename = $this->_getFilePathName($documentid);
	if (!file_exists($filename)) {
		return null;
	}
	$handle = fopen($filename, 'r');
	$contents = fread($handle, filesize($filename));
	fclose($handle);
	$unserialized = unserialize($contents);
	return $unserialized;

clearDocuments

	$fp = opendir(DOCUMENTLOCATION);
	while(false !== ($file = readdir($fp))) {
		if(is_file(DOCUMENTLOCATION.$file)){
		  unlink(DOCUMENTLOCATION.$file);
		}
	}

Each one is pretty simple to follow through. Store document just serialises the document to disk. Get document unserialises it and clear deletes every document in the documentstore. Odds are we will want a database version down the line as this is massively inefficient but for the moment this should serve us well.

THE INDEXER

The next step in building our search is to create an indexer. Essentially it takes in a document, breaks it apart and feeds it into the index, and possibly the document store depending on your implementation. A simple interface like the following should get us going,

	interface iindexer {
		public function index(array $documents);
	}

With that defined lets build a naive indexer. We are going to take in a list of documents, save it to disk, clean them up, split by spaces, build a concordance (list of words and how often they occur in the document) and the use this to feed into our indexer and our document store.

	public function index(array $documents) {
		if(!is_array($documents)) {
			return false;
		}
		foreach($documents as $document) {
			$id = $this->documentstore->storeDocument(array($document));
			$con = $this->_concordance($this->_cleanDocument($document));
			foreach($con as $word => $count) {
				$ind = $this->index->getDocuments($word);
				if(count($ind) == 0) {
					$this->index->storeDocuments($word,array(array($id,$count,0)));
				}
				else {
					$ind[] = array($id,0,0);
					$this->index->storeDocuments($word,$ind);
				}
			}
		}
		return true;
	}

Pretty simple and does exactly what we have just mentioned. We take in an array of documents look over them, store them and create our concordance then for each word add them to our index with the document id and the count of the document.

INDEXING

Now that we have the ability to store and index some documents lets throw some code together which actually does so. The first thing is to set everything up and then pass in some documents to index.

	set_time_limit(0);
	define('INDEXLOCATION',dirname(__FILE__).'/index/');
	define('DOCUMENTLOCATION',dirname(__FILE__).'/documents/');

	include_once('./classes/naieveindexer.class.php');
	include_once('./classes/singlefolderindex.class.php');
	include_once('./classes/singlefolderdocumentstore.class.php');

	$index = new singlefolderindex();
	$docstore = new singlefolderdocumentstore();
	$indexer = new naieveindexer($index,$docstore);

The first thing we do is set the time limit to unlimited as the indexing might take a long time. Then we define where the index and the documents are going to live to avoid the errors being throw further down the line. If you are following along you will need to allow whatever user php is running under to have the ability to at least write to these directories. The next 3 lines are including the relevant classes and the last 3 are setting the index document store and indexer up.

With that done we can finally add some documents to be indexed.

	$indexer->index(array('Setting the AuthzUserAuthoritative directive explicitly to Off allows for user authorization to be passed on to lower level modules (as defined in the modules.c files) if there is no user matching the supplied userID.'));
	$indexer->index(array('The Allow directive affects which hosts can access an area of the server. Access can be controlled by hostname, IP address, IP address range, or by other characteristics of the client request captured in environment variables.'));
	$indexer->index(array('This directive allows access to the server to be restricted based on hostname, IP address, or environment variables. The arguments for the Deny directive are identical to the arguments for the Allow directive.'));
	$indexer->index(array('The Order directive, along with the Allow and Deny directives, controls a three-pass access control system. The first pass processes either all Allow or all Deny directives, as specified by the Order directive. The second pass parses the rest of the directives (Deny or Allow). The third pass applies to all requests which do not match either of the first two.'));
	$indexer->index(array('The AuthDBDUserPWQuery specifies an SQL query to look up a password for a specified user.  The users ID will be passed as a single string parameter when the SQL query is executed.  It may be referenced within the query statement using a %s format specifier.'));
	$indexer->index(array('The AuthDBDUserRealmQuery specifies an SQL query to look up a password for a specified user and realm. The users ID and the realm, in that order, will be passed as string parameters when the SQL query is executed.  They may be referenced within the query statement using %s format specifiers.'));

The above are just some Apache directives taken from this list http://searchcode.com/lists/list-of-apache-directives. If you have a look in the folders defined, those being documents and index you can see the the documents actually being stored. At this point we can pat ourselves on the back for a job well done or we can knuckle down and write the search class so we can actually see if it works. Lets go with the latter.

SEARCHING

Searching requires a relatively simple implementation. So simple in fact we only require a single method.

	interface isearch {
	  public function dosearch($searchterms);
	}

Of course the actual implementation can get pretty hairy. The reason for this is that a search is actually more complex then it first appears. Take for example the following search term “cool stuff”. When someone searches for “cool stuff” they are expecting that a list of documents containing the words cool and stuff will appear in a list. What is actually happening under the hood however is a lot more work. The search term firstly needs to be parsed into a query. In the case of a default AND engine you need to find all documents containing the word “cool”, and then all documents containing the word “stuff”, get the intersection of those documents, rank them in order of relevance and return them to the user.

Sticking to keeping it very simple we are going to do the following. Clean up the search term (removing characters that have no meaning), break it into individual works separated by spaces and then for each term just return all the documents that match. No ranking, or AND logic which makes things much simpler to work with. So we will effectively end up with an OR logic search engine.

	public function dosearch($searchterms) {
		$doc = array();
		foreach($this->_cleanSearchTerms($searchterms) as $term) {
			$ind = $this->index->getDocuments($term);
			if($ind != null) {
				foreach($ind as $i) {
					$doc[] = $this->documentstore->getDocument($i[0]);
				}
			}
		}
		return $doc;
	}

	public function _cleanSearchTerms($searchterms) {
		$cleansearchterms = strtolower($searchterms);
		$cleansearchterms = preg_replace('/\W/i',' ',$cleansearchterms);
		$cleansearchterms = preg_replace('/\s\s+/', ' ', $cleansearchterms);
		return explode(' ',trim($cleansearchterms));
	}

The function _cleanSearchTerms takes in the terms, removes all non A-Z 0-9 characters using a regex and then removes all duplicate spaces. It then splits it up by a single space and returns the array. We then just loop over each of the terms query the index for the matching documents, then for each document that matches fetch it from the document store and finally return the list. It means we will get duplicate results where you have terms that are in the same document however.

With the ability to search lets plug it into a simple PHP page and actually display the results.

	define('INDEXLOCATION',dirname(__FILE__).'/index/');
	define('DOCUMENTLOCATION',dirname(__FILE__).'/documents/');

	include_once('./classes/naieveindexer.class.php');
	include_once('./classes/naievesearch.class.php');
	include_once('./classes/singlefolderindex.class.php');
	include_once('./classes/singlefolderdocumentstore.class.php');

	$index = new singlefolderindex();
	$docstore = new singlefolderdocumentstore();
	$indexer = new naieveindexer($index,$docstore);
	$search = new naievesearch($index,$docstore);

	echo '<ul>';
	foreach($search->dosearch($_GET['q']) as $result) {
		echo '<li>'.$result[0].'</li>';
	}
	echo '</ul>';

Nothing fancy here, just set everything up, then do a search through our new function and loop over the results. A sample search for “AuthzUserAuthoritative” gives back the correct result, whereas searching for “user” gives back the 3 expected results.

Hurray! At this point we have done it. We have a minimalist implementation of a search engine. I did some further tests and discovered that this actually holds together to about 50,000 documents. Searches are still reasonably quick at this level however indexing a single document took minutes which is hardly optimal. For the moment however we can index 1,000 or so documents while we work on getting the search to have some additional functionality such as ranking.

RANKER

The ranking of search results is the core of search engines secret sauce. Nobody with the knowledge how how it works really discusses how Google and Bing rank things. *NB* ProCog has an open algorithm, see here http://procog.com/help/rank_details We know they use many signals such as page speed, keywords links, title etc… but the details are still a mystery. Thankfully there are many published methods of ranking which we can explore. Any sort of search on this and you will turn up papers on things like BM25 and Vector Spaces. Once again we are going to take the simplest approach and just use the number of words in the document. So if you search for “cat” a document with two “cat” words in it will be ranked higher then one with a single instance of “cat”. We are already storing this information in the index so it should be fairly simple the implement. Later we will look some more complex ways of ranking.

Essentially the ranker can be very simple as it only needs to provide a function which we can plug into PHP’s usort function.

	interface iranker {
		public function rankDocuments($document,$document2);
	}

The implementation is pretty simple really,

	public function rankDocuments($document,$document2) {
		if(!is_array($document) || !is_array($document2)) {
			throw new Exception('Document(s) not array!');
		}
		if(count($document) != 3 || count($document2) != 3) {
			throw new Exception('Document not correct format!');
		}
		if(	$document[1] == $document2[1] ) {
			return 0;
		}
		if(	$document[1] <= $document2[1] ) {
			return 1;
		}
		return -1;
	}

All we do is validate the input then based on the wordcount (which we kept in the second location of each document) we return 1 -1 or 0. This is then fed into usort like the following,

	usort($results, array($this->ranker, 'rankDocuments'));

Which sorts the $results in place. With that we can add it to our naievesearch implementation and see how it all works together. At line 21 of our search naievesearch implementation we just insert the following,

	usort($ind, array($this->ranker, 'rankDocuments'));

We can now go to our search page and see how things look. A sample search for the word “order” returns two documents with the first containing two instances of the word order and the second only one. Trying again with the word “the” has the same result with the first result having the word “the” 7 times compared to the last result with 4 instances of the word “the”.

INITIAL RESULTS

The nice thing at this point is we actually have a fully functioning search index written in pure PHP. In theory this should be good to integrate at this point with any small website. In its present form the search should hold up to less then 10,000 documents without too many issues. My thinking is that read/write time of each of the index shards during the indexing process would be the main bottleneck in its present form. If you indexed as you add things however IE don’t do it in batch this indexer might serve to a much larger size. Other problems are that the ranking algorithm is fairly crude and that duplicate results are returned where two keywords match the same document. So while you could use this on a live site you wouldn’t want to.

I did a quick test of this implementation using 5000 posts from various RSS feeds to see how things fared. While searches remain reasonably quick the index time became slower and slower as it progressed. Chucking it through a profiler showed my previous assumption to be correct and that a lot of time is spent reading/writing shards. This makes the implementation totally unpractical for indexing more then say 10,000 documents, its just too slow.

Of course this isn’t what we set out to do. The goal was to index one million documents and search on them.

Over the next few steps we will profile the application, work out where the performance issues lie and then work on improving our implementation so that we can index and serve queries on a million documents. To start we will run things through a profiler and see what can be improved.

You can download a copy of the application as it currently stands here.

PHPSearch_Implementation_1.zip

This is the end of Part 1. In Part 2 we are going to see if we can improve some of the main issues and index up-to 100,000 documents.

*NB* I stopped developing the unit tests at this point mostly due to time pressures caused by family life. I still wanted to release the whole article though and something had to give.