Installing Phindex

This is a follow on piece to my 5 part series about writing a search engine from scratch in PHP which you can read at http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

I get a lot of email requests asking how to setup Phindex on a new machine and start indexing the web. Since the article and code was written aimed at someone with a degree of knowledge of PHP this is somewhat understandable. What follows is how to set things up and start crawling and indexing from scratch.

The first thing to do is setup some way of running PHP and serve pages. The easiest way to do this is install Apache and PHP. If you are doing this on Windows or OSX then go and install XAMPP https://www.apachefriends.org/index.html For Linux follow whatever guide applies to your distribution. Be sure to follow the directions correctly and verify that you can create a file with php_info(); inside it which runs in your browser correctly.

For this I am using Ubuntu Linux and all folder paths will reflect this.

With this setup what you need to do next is create a folder where we can place all of the code we are going to work with. I have created a folder called phindex which I have ensured that I can edit and write files inside.

Inside this folder we need to unpack the code from github https://github.com/boyter/Phindex/archive/master.zip

boyter@ubuntu:/var/www/phindex$ unzip master.zip
Archive:  master.zip
2824d5fa3e9c04db4a3700e60e8d90c477e2c8c8
   creating: Phindex-master/
.......
  inflating: Phindex-master/tests/singlefolderindex_test.php
boyter@ubuntu:/var/www/phindex$

At this point everything should be running, however as nothing is indexed you wont get any results if you browse to the search page. To resolve this without running the crawler download the following https://www.dropbox.com/s/vf4uif4yfj8junf/documents.tar.gz?dl=0 and unpack it to the crawler directory.

boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$ tar zxvf documents10000.tar.gz
......
boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$ ls
crawler.php  documents  documents10000.tar.gz  parse_quantcast.php
boyter@ubuntu:/var/www/phindex/Phindex-master/crawler$

The next step is to create two folders. The first is called “document” and the second “index”. These are where the processed documents will be stored and where the index will be stored. Once these are created we can run the indexer. The folders need to be created in the root folder like so.

boyter@ubuntu:/var/www/phindex/Phindex-master$ ls
add.php  crawler    index       README.md   tests
classes  documents  interfaces  search.php
boyter@ubuntu:/var/www/phindex/Phindex-master$

With that done, lets run the indexer. If you cannot run php from the command line, just browse to the php file using your browser and the index will be built.

boyter@ubuntu:/var/www/phindex/Phindex-master/$ php add.php
INDEXING 1
INDEXING 2
.....
INDEXING 10717
INDEXING 10718
Starting Index
boyter@ubuntu:/var/www/phindex/Phindex-master/$

This step is going to take a while depending on how fast the computer you are using is. Whats happening is that each of the crawled documents is processed, saved to the document store, and then finally each of the documents is indexed.

At this point everything is good. You should be able to perform a search by going to the like so,

Phindex Screenshot

At this point everything is working. I would suggest at this point you start looking at the code under the hood to see how it all works together. Start with add.php which gives a reasonable idea how to look at the crawled documents and how to index them. Then look at search.php to get an idea on how to use the created index. I will be expanding on this guide over time based on feedback but there should be enough here at this point for you to get started.

More interview snippets….

Since I wrote the code to these snippets I thought I may as well add them here in case I ever need them again or want to review them. As the other interview ones they are the answers to a question I was asked, slightly modified to protect the innocent. These ones are written in Python.

Q. Write a function to reverse each word in a string.

def reverse_each_word(words):
    '''
    Reverse each word in a string 
    '''
    return " ".join([x[::-1] for x in words.split(' ')])

The only thing of note in here is the x[::-1] which is extended slice syntax which reverses a string. You could also to reversed(x) although I believe at the time of writing it is MUCH slower.

Q. Given two arrays find which elements are not in the second.

def find_not_in_second(first, second): 
    '''
    Find which numbers are not in the
    second array
    '''
    return [x for x in first if x not in second]

I am especially proud of the second snippet as it is very easy to read and rather Pythonic. It takes in two lists such as [1,2,3] and [2,3,6] and returns a new list with the missing elements.

Another day another interview…

Another day another interview. I actually have been getting some good results from them so far. In particular the last two I have been on. I will discuss them briefly.

The first had an interesting coding test. Rather then asking me to solve Fizzbuzz or implement a depth first algorithm over a binary tree (seriously, I have been programming for 10 years and never needed to do that. I can, but its something I did in uni and not really applicable to anything I have done since then). It was to implement a simple REST service.

You created your service, hosted it online (heroku was suggested as its free) passed in the URL to a form, submitted and it hit your service looking for error codes and correct responses/output to input. Since you got to implement it in any language you want I went with Python/Django and produced the following code.

def parse_json(self, data):
	filtered = self.filter_drm(data['payload'])
	filtered = self.filter_episode_count(filtered)

	return self.format_return(filtered)

def filter_drm(self, data):
	if data is None or data == []:
		return []

	result = [x for x in data if 'drm' in x and x['drm'] == True]
	return result

def filter_episode_count(self, data, count=0):
	if data is None or data == []:
		return []

	result = [x for x in data if 'episodeCount' in x and x['episodeCount'] > count]
	return result

def format_return(self, data):
	if data is None or data == []:
		return {"response": []}

	result = [{	"image": x['image']['showImage'], 
				"slug": x['slug'],
				"title": x['title']} for x in data 
				if 'image' in x and 'slug' in x and 'title' in x]
	return {"response": result}

Essentially its the code from the model I created. It takes in some JSON data, filters it by the field DRM and Episode count, then returns a subset of the data in it. The corresponding view is very simple, with just some JSON parsing (with error checks) and then calling the above code. I did throw in quite a few unit tests though to ensure it was all working correctly.

Thankfully, after writing the logic, some basic testing (curl to fake a response) it all looked OK to me. I uploaded on heroku (never used it before and it took most of the time) and submitted the form. First go everything worked correctly passing all of the requirements listed which made me rather happy.

As for the second interview, it raised a good question which highlights the fact while I know how to write a closure and lambda I cannot actually say what they are. It also highlighted I really need to get better at Javascript since while I am pretty comfortable with it on the front end for backend processes such as node.js I am an absolute notice.

For the first, I was right about a lambda, which is just an anonymous function. As for the second part a closure is a function which closes over the environment allowing it to access variables not in its function list. An example would be,

def function1(h):
    def function2():
        return h
    return function2()

In the above function2 closes over function1 allowing it to access the the variables in function1’s environment such as h.

The other thing that threw me was implementing a SQL like join in a nice way. See the thing is I have been spoilt by C# which makes this very simple using LINQ. You literally join the two lists in the same way SQL would and it just works. Not only that the implementation is really easy to read.

I came up with the following which is ugly for two reasons,

1. its not very functional
2. it has very bad  O(N^2) runtime performance.

var csv1 = [
    {'name': 'one'},
    {'name': 'two'}
];

var csv2 = [
    {'name': 'one', 'address': '123 test street'},
    {'name': 'one', 'address': '456 other road'},
    {'name': 'two', 'address': '987 fake street'},
];

function joinem(csv1, csv2) {
    var ret = [];
    $.each(csv1, function(index, value) {
        $.each(csv2, function(index2, value2) {
            if(value.name == value2.name) {
                ret.push(value2);
            }
        });
    });

    return ret;
}

var res1 = joinem(csv1, csv2);

Assuming I get some more time later I want to come back to this. I am certain there is a nice way to do this in Javascript using underscore.js or something similar which is just as expressive as the LINQ version.