Building a Search Result Extract Generator in PHP

During some contracting I was doing recently there was a requirement to implement some search logic using only PHP. There are no issues with that but it turns out I couldn’t find a decent extract generator handy as usually one would just plug into the search engines provided version to do this.

Off the top of my head I could only think of one example I was aware of which lives in Sphider (for the record it lives in searchfuncs.php from line 529 to 566). Sadly it has a few issues. Firstly the code is rather difficult to understand, and more importantly it usually has accuracy issues. A quick search turned up this link http://stackoverflow.com/questions/1436582/how-to-generate-excerpt-with-most-searched-words-in-php on StackOverflow. The second answer looked promising but its even more difficult to understand and a bit of profiling showed some performance issues will all of the regex going on in there.

Since I couldn’t find a solution I was happy with I naturally decided to write my own. The nice thing about reinventing the wheel is you can get a round one. The algorithm is fairly simple,

1. Identify all the matching word locations.
2. Work out a section of text that best matches the terms.
3. Based on the snip location we trim around the string ensuring we don’t skip whole words and don’t remove the last or first word if that’s the actual match.

Sounds good in theory, but lets see the results.

Sample Text

“Welcome to Yahoo!, the world’s most visited home page. Quickly find what you’re searching for, get in touch with friends and stay in-the-know with the latest news and information. CloudSponge provides an interface to easily enable your users to import contacts from a variety of the most popular webmail services including Yahoo, Gmail and Hotmail/MSN as well as popular desktop address books such as Mac Address Book and Outlook.”

Search Term “yahoo and outlook”

Sphider Snippet
“get in touch with friends and stay in-the-know with the latest news and information. CloudSponge provides an interface to easily enable your users to import contacts from a variety of the most popular webmail services including Yahoo, Gmail and Hotmail/MSN as well as popular desktop address books such as Mac Address Book and”

Stackoverflow Snippet
“Welcome to Yahoo!, the world’s most visited home page. Quickly find what you’re searching for, get in touch with friends and stay in-the-know with the latest news and information. CloudSponge provides an interface to easily enable your users to import contacts from a variety of the most…”

My Snippet
“..an interface to easily enable your users to import contacts from a variety of the most popular webmail services including Yahoo, Gmail and Hotmail/MSN as well as popular desktop address books such as Mac Address Book and Outlook.”

I consider the results to be equally good in the worst case and better in most cases I tried. I also tried each over much larger portions of text and both the Sphider and Stackoverflow seemed to produce either nothing relevant or were missing what I thought was the best match.

As always the code is on GitHib.

BATF – Big Arse Text File

Ever needed the ability to track bugs and features without using a full featured bug/feature tracker? What about storing all your random notes such as server details, blog ideas, books to read, urls etc, without using a full featured CMS or the like. Want to have everything searchable and in the most platform independent format possible?

Enter the BATF. I have always been a fan of the big arse text file (BATF) for keeping track of the above. The catch being I wanted it centralised so I could get at it from any machine I was on (assuming internet access). I also wanted it to provide a simplistic version system. Tags would be useful too.

You can of course do this using SVN, GIT, or any other versioning system. The problem with that is that it brought be close to what I didn’t want to do (setup lots of stuff). So after a few beers I decided that what I really wanted was a BATF that had versioning (simple versioning anyway) built in, was web based so I could access it anywhere and lightweight. Since my 5 minute web search didn’t turn up anything that could do this I thought I would create one.


Behold the online BATF. Everything you add or modify is viewable in a nice timeline of versions. Explore your thought process as you add/modify things. Have something important you want to preserve for some amount of time? Tag it and it will always accessible. You can also explore changes through a simplistic diff viewer that diff’s against the current version.

Since I have been using it for a while and found it useful I thought I would give back to the community which provided the language (PHP), database (MySQL), Javascript framework (JQuery + plugins) and icons (famfamfam icons) by releasing this as free software. You can get a copy at GitHub https://github.com/boyter/BATF. The install instructions are included (pretty simple really). Feel free to fork it and send back patches if you find any bugs etc…