Vector Space Search Model Explained

A mate of mine was looking at a previous article I wrote about Decoding CAPTCHA’s where I pointed people to the following article (PDF)

He was having some difficulty understanding it so I thought I would write up a very simple explanation of what’s actually happening in the vector space.

The vector space isn’t actually that complicated, but getting your head around how it works takes a few steps. Let’s imagine for the moment that we are going to search over a collection of documents which only contain one word “boffin”. Each document has a different count of our word but only contain those words. We can represent this visually using a simple graph.

              1 2 3  4

The above is trying to show along the X axis the count of our word “boffin” that appears in the document whose id appears above the line. You can see that document 1 has 1 “boffin” words in it and document 4 has 8. You can also see that document 1 is closer to document 2 then it is to document 4. Let’s say we decide to search for boffin twice IE our search term is “boffin boffin”.

               +------+ "boffin boffin"
              1X2 3  4

What you can see quite clearly is that the closest document to our search (labelled in X) is document 1 and 2 followed by document 3 and 4. Let’s try searching for “boffin boffin boffin boffin”

                 +------+ "boffin boffin boffin boffin"
              1 2X3  4

You can now see that documents 2 and 3 are equally close to our search document, followed by document 1 and document 4.

So far so good. We can work out how close any two “boffin” documents are and thus work out how relevent they are to a search. Let’s add another word to our index. Our new documents are going to contain the words “boffin” and “head”

           8|     3                                                      
           4|        4                                                   
           2| 1                                                          
           0|   2                                                        

Along the X axis we still have the count for the word boffin. However we also have a Y axis which represents the count of the term “head”. Thus document 1 has 1 “boffin” word in it and 2 “head” words. You can also see that document 1 is closer to document 2 then document 3. Those of you who remember high school should remember this fellow.

Bust of Pythagoras

Yes Pythagoras, a philosopher and mathematician came up with a theorem (although I suspect it should be a law now) that you could use to work out the length of a triangle’s side assuming the triangle is right angled and you know the length of the other two sides. We can do that now to work out the distance between document 1 and every other document in our collection.

Document 1 Distances

1 to 2 = 2.83
1 to 3 = 7.28
1 to 4 = 7.21

Assuming now that we had a search which had the terms “boffin head head” we can work out that the next most relevant documents would be 1 (exact match) followed by 2, 4 and 3, simply by calculating out the distances and ordering by them.

So far so good. Now to the next step, adding a third dimension which would represent a third word. It follows the same basic idea as before, just we need to do a few extra calculations’ to work out which documents are closest to each other. I am not going to do this here but im sure you can see how its a natural progression of the idea.

Now the “magic” of the vector space is realising that if you can work out the distance between two documents when cast in 1d (first example) 2d (second example) and 3d space, you can continue to do so in 4d, 5d, 6d etc… You can’t represent it in the physical world, but in the mathematical world it works perfectly. This is the trick behind the vector space model and how it calculates how close any documents are.

Additional reading over the topic can be found here,

Grep Match a Tab

Ever wanted to match a tab while using grep for some reason? The trick (under bash anyway) is to Ctrl+V and then press the tab key so you get whatever you are looking for.

$ cat file_to_grep.txt | grep "^log    "

I was trying to match a file for the exact match of log and then a tab. Without the tab I ended up getting back a bunch of junk results like “logger” “logging” “login” etc…

The above gives me what I want although I suspect there is a better way to do this. I did look into using [[:space:]] but it matches spaces as well which ended up in my case not being accurate enough.

Clojure 1.3 Now Avaliable

Added Clojure 1.3 documentation today. The Clojure language itself is pretty tight, but its nice to have something to browse through when looking for some elusive method.

You can view it by searching for clojure

One thing I have discovered while doing this is that if I have a slight amount of familiarity with the language it really makes parsing it easier. This stands to reason as I can pick up errors more easily, but somewhat surprising. Another thought is that I think there should be a standard for programming language documentation. Every language has its own which means each one needs a specialised parser.

The bad thing about about standards? There are so many.

Link Love

With the fall from grace of the TWiT podcast (less Dvorak and no Calacanis makes for boring shows) I went looking for new podcast’s to keep me entertained over the last couple of months. Here are a few that I highly recommend.

Tech Podcast TechZing Live

The boys from tech-zing are full of energy, always come up with new stuff and usually manage to do one thing technical each show that makes me want to scream with frustration. All in all good stuff. They also interviewed both Calacanis and Dvorak which were probably 2 of the better shows they did.


Leading on from the fact that Dvorak and Calacanis make for a good show X3 is quite the goods. Its basically Cranky Geeks reborn.

This Week In Start-ups

The whole “you ripped me off” thing aside TWIST is worth listening too. Not as tech heavy as I normally prefer in a podcast but quite good.

That’s all I have on my plate for the moment due to time constraints and that since I bought a Kindle I have spent more of my time reading these days.