Python Snippet

The below is a quick Python snippet which I use on a day to day basis for weeks, then promptly forget. Essentially its reading from standard input and then doing something with it. Very useful when you are trying to process data on the command line and have forgotten how to use awk/sed properly and grep has run out of steam.

import sys
import re

for line in sys.stdin:
  values = line.split(',')
  print "%s\t%s"%(values[0],values[1])

The above just takes standard input, splits it on commas and prints out out with a tab space between them. A useless example, but shows the concept quite well.

How Cuil got $33 million in funding?

Well it appears that Cuil the troubled search engine that didn’t is dead.

While I am not that surprised by this considering its lackluster results, I do feel that this is bad for the web in general. With the Yahoo/Bing deal we now have very few independent indexes that power search on the web. The big players are down to the following it would seem,

Google
Bing
Ask
Gigablast
Blekko

What did suprise me about this though was the question people were asking about how the founders of Cuil ever managed to secure $33 million in funding. The answer is actually pretty simple.

1. The founders Anna Patterson and Tom Costello (not sure about the 3rd founder) are well known in the world of Search and have a proven track record of creating large indexes which are the basis of an search engine. Anna held the record for creating the worlds largest index at one stage.

2. They probably had a tech demo which showed how they could scale this out from one machine to multiple. You can’t actually see how well its going to work on the web until you index most of it, but the initial start looked good. The time taken from stealth to launch was doing the scale out and crawling. This is just my gut feeling but it would be similar to the way Google secured its initial $100,000 worth of funding.

3. The potential payoffs are huge. Jason Calacanis was quoted as saying that each 1% of the search market is worth $1 billion a year. If Cuil was able to deliver and get just 5% of the market then thats a huge return on investment which is exactly what VF’s are looking for.

Hats off to them for the achievement’s they made and for at least trying to take on the big boys of search. While new engines like DuckDuckGo are attempting the same, Gabriel Weinberg uses the existing indexes of Bing and Google without really maintaining a huge custom one. I can only hope at this point that Blekko delivers the goods (and supposedly its getting close!) and that the other players like Gigablast continue to innovate.

PHP Entity Generator

A while ago I was using the Django Framework and was a big fan of the parts of it which save time. One part which I both loved and hated was the ORM. The bit I loved was for creating new database entities, loading them and modifying them. The ability to just load up an object and modify it and then call its save method saved me a lot of time. What I hated about it however was using it for doing any form of complex query (since I am very comfortable with SQL) and working backwards by designing the model then the database.

Get the source at PHP Entity Generator on Google Code

A while ago I needed to do some coding in PHP and missed the ORM part of Django so much I created a simple script which would generate entities based on tables in the database. In order to make things a bit simpler I added the following requirements,

  1. Would only support the following types, int, varchar, text, datetime
  2. Would generate entities off the database and not the reverse
  3. Would require a autoincrement id used as the primary key
  4. Would try to be as typesafe as possible
  5. Would generate correct inject safe SQL
  6. Would generate code that could be extended if say I needed a method to get by some other type
  7. Required that a PDO database object be passed in for database operations

With those requirements I got to coding, and after a while I had something that outputs the following example, (please excuse the lack of formatting I have a lack of CSS/HTML skills)

<?php
//////////////////////////////////////////////////////////////
// This class generated by a tool on 2010-09-15 at 08:58:58 //
//////////////////////////////////////////////////////////////
class article {
private $_id = null;
private $_rssid = null;
private $_title = null;
private $_link = null;
private $_date = null;
private $_description = null;
private $_author = null;
private $_publisheddatetime = null;
private $_image = null;

function getid() {
return $this->_id;
}
function getrssid() {
return (int)$this->_rssid;
}
function gettitle() {
return $this->_title;
}
function getlink() {
return $this->_link;
}
function getdate() {
return $this->_date;
}
function getdescription() {
return $this->_description;
}
function getauthor() {
return $this->_author;
}
function getpublisheddatetime() {
return strtotime($this->_publisheddatetime);
}
function getimage() {
return $this->_image;
}
function setrssid($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - rssid");
}
if(gettype($newvalue) != "integer") {
throw new Exception("Value not integer - ".$newvalue);
}
if(strlen($newvalue) > 10) {
throw new Exception("Value size larger then then - 10 - ".$newvalue);
}
$this->_rssid = (int)$newvalue;
}
function settitle($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - title");
}
if(gettype($newvalue) != "string") {
throw new Exception("Value not string - ".$newvalue);
}
if(strlen($newvalue) > 1000) {
throw new Exception("Value size larger then then - 1000 - ".$newvalue);
}
$this->_title = $newvalue;
}
function setlink($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - link");
}
if(gettype($newvalue) != "string") {
throw new Exception("Value not string - ".$newvalue);
}
if(strlen($newvalue) > 500) {
throw new Exception("Value size larger then then - 500 - ".$newvalue);
}
$this->_link = $newvalue;
}
function setdate($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - date");
}
if(gettype($newvalue) != "string") {
throw new Exception("Value not string - ".$newvalue);
}
if(strlen($newvalue) > 100) {
throw new Exception("Value size larger then then - 100 - ".$newvalue);
}
$this->_date = $newvalue;
}
function setdescription($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - description");
}
if(gettype($newvalue) != "string") {
throw new Exception("Value not string - ".$newvalue);
}
$this->_description = $newvalue;
}
function setauthor($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - author");
}
if(gettype($newvalue) != "string") {
throw new Exception("Value not string - ".$newvalue);
}
if(strlen($newvalue) > 100) {
throw new Exception("Value size larger then then - 100 - ".$newvalue);
}
$this->_author = $newvalue;
}
function setpublisheddatetime($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - publisheddatetime");
}
if(gettype($newvalue) != "integer") {
throw new Exception("Value not integer - ".$newvalue);
}
$this->_publisheddatetime = date("Y-m-d H:i:s",(int)$newvalue);
}
function setimage($newvalue) {
if(is_null($newvalue)) {
throw new Exception("Value cannot be null - image");
}
if(gettype($newvalue) != "string") {
throw new Exception("Value not string - ".$newvalue);
}
if(strlen($newvalue) > 500) {
throw new Exception("Value size larger then then - 500 - ".$newvalue);
}
$this->_image = $newvalue;
}

function save($db) {
if(is_null($this->_id)) {
$query = $db->prepare("INSERT INTO `article` VALUES (NULL,?,?,?,?,?,?,?,?);");
$query->execute(array($this->_rssid,$this->_title,$this->_link,$this->_date,$this->_description,$this->_author,$this->_publisheddatetime,$this->_image));
$this->_id = $db->lastInsertId();
}
else {
$query = $db->prepare("UPDATE `article` SET `rssid`=?,`title`=?,`link`=?,`date`=?,`description`=?,`author`=?,`publisheddatetime`=?,`image`=? WHERE `id`=? LIMIT 1;");
$query->execute(array($this->_rssid,$this->_title,$this->_link,$this->_date,$this->_description,$this->_author,$this->_publisheddatetime,$this->_image,$this->_id));
}
}

function delete($db) {
if(is_null($this->_id)) {
return;
}
else {
$query = $db->prepare("DELETE FROM `article` WHERE `id`=? LIMIT 1;");
$query->execute(array($this->_id));
}
}

function get($db,$id) {
if(gettype($id) != "integer") {
throw new Exception("Value not integer");
}
else {
$query = $db->prepare("SELECT `id`,`rssid`,`title`,`link`,`date`,`description`,`author`,`publisheddatetime`,`image` FROM `article` WHERE `id`=? LIMIT 1;");
$query->execute(array($id));
foreach ($query->fetchAll() as $row_id => $row_data) {
$this->_id = $row_data["id"];
$this->_rssid = $row_data["rssid"];
$this->_title = $row_data["title"];
$this->_link = $row_data["link"];
$this->_date = $row_data["date"];
$this->_description = $row_data["description"];
$this->_author = $row_data["author"];
$this->_publisheddatetime = $row_data["publisheddatetime"];
$this->_image = $row_data["image"];
}
}
}
}
?>

Not too shabby I think! I have added the source to Google Code under the new BSD licence which should make it open enough for anyone to do whatever they want with it. It is missing unit tests any form of objects or even comments but it is quite useable. You can find the PHP Entity project in the link hosted at Google Code.

Google’s “Colossus”

So Google has called their new indexing system Caffeine, which is powered by Google’s BigTable, or as they call it internally Colossus. I guess now all we need is Microsoft to announce their Bing back-end is called “Guardian” and the world is over.

Actually looking at all of the information they have shows that while everyone was chasing MapReduce that Google was looking at implementing a distributed database where each update/trigger implements an update to the index. I am certain that this wouldn’t be as efficient as running a MapReduce index build over the whole cluster, but would allow for real time updates.

Interestingly this is what I implemented in my own Search Service (which is still in testing mode) which gives a minute or two delay between something being crawled and added to the index. I suspect that Google noticed that this was the best way of implementing things on one machine and just figured out how to replicate it on thousands.

Interestingly this sounds somewhat similar to how Gigablast works although Matt Wells Rants Page never goes into specific details.

First Failure at Selling an Application Online

So I guess now is about the time that I write about my first failure. Although I realised that the project was a failure quite a while ago I never wrote anything about it admitting so. I guess this can be considered my cleansing moment.

So about a year ago when everyone was jumping on the Twitter bandwagon I remember reading about a simple app called MyTwitterButler that a .NET developer coded up in a few hours and was selling for $10. It was a desktop app that let you type in words to search for and then would follow users who tweeted those words.

At the time I was looking to improve my .NET skills so I thought that if he can make $50 a day selling that I could write something similar and hopefully regain my initial cost. So over 8 hours or so I wrote a simple twitter follower application which I thought had some better features then MyTwitterButler, bought http://www.tweet-follow.com/ and tried to sell it. I figured with time and the domain I needed to make about $400 to cover my costs (assuming a $50/hour billing rate). You know what. I didn’t sell a single copy.

Tweet-Follow Screenshot

Tweet-Follow Screenshot

Why did it fail? That’s the question I asked myself. It was in the same space, I did all the same marketing tricks others had (adding to blogs, application stores, emailing copies to bloggers) and nothing. You know what I still don’t know. The few people that I did get to use it said it worked pretty well and did what they wanted so it wasn’t lack of functionality.

Just recently I bought and read the excellent Startup Book by Rob Walling. One of the takeaways is that marketing always wins where there are two similar products. When you realise I was trying to compete with MyTwitterButler without considering all the other power twitter clients out there, I can see how I was trying to compete with millions of dollars. With this in mind its not really a surprise I didn’t succeed in my initial goal to cover costs. The other and probably most important thing, is that I didn’t really chase success. Looking back I expected the project to fail, and guess what it did. That said though, at least it was a cheap failure, which are the ones you probably want to have.

EDIT – I just realised I should probably release the source code to this application in the hope that someone finds it useful. I will be doing so over the next few days.

Small Steps 2 – Teaching a Neural Network to Learn the Letter A from B-Z

So in the previous article we managed to get our neural network to learn the difference between A and B. I mentioned at the end I was going to next test and teach it on various versions of A and B to see how effective it is, but rather then that I figured teaching a network to learn A from every other letter would be more interesting.

Get the source to everything below in Step2

Now the code below is rather un-pythonic but it does show us loading each of the letters and then training the network to learn that an A is an A and that every other letter is not an A. I had initially tried to teach it how to recognise each letter however I found this resulted in a huge neural network which was slow to train. For the moment teaching the network what an A is should be fine for now.

import bpnn
import Loader

if __name__ == '__main__':
  cla = Loader.Loader()

  hiddennodes = 3
  x = 5
  y = 5

  adata = cla.loadimagedata("./letters/A.gif",x,y)
  bdata = cla.loadimagedata("./letters/B.gif",x,y)
  cdata = cla.loadimagedata("./letters/C.gif",x,y)
  ddata = cla.loadimagedata("./letters/D.gif",x,y)
  edata = cla.loadimagedata("./letters/E.gif",x,y)
  fdata = cla.loadimagedata("./letters/F.gif",x,y)
  gdata = cla.loadimagedata("./letters/G.gif",x,y)
  hdata = cla.loadimagedata("./letters/H.gif",x,y)
  idata = cla.loadimagedata("./letters/I.gif",x,y)
  jdata = cla.loadimagedata("./letters/J.gif",x,y)
  kdata = cla.loadimagedata("./letters/K.gif",x,y)
  ldata = cla.loadimagedata("./letters/L.gif",x,y)
  mdata = cla.loadimagedata("./letters/M.gif",x,y)
  ndata = cla.loadimagedata("./letters/N.gif",x,y)
  odata = cla.loadimagedata("./letters/O.gif",x,y)
  pdata = cla.loadimagedata("./letters/P.gif",x,y)
  qdata = cla.loadimagedata("./letters/Q.gif",x,y)
  rdata = cla.loadimagedata("./letters/R.gif",x,y)
  sdata = cla.loadimagedata("./letters/S.gif",x,y)
  tdata = cla.loadimagedata("./letters/T.gif",x,y)
  udata = cla.loadimagedata("./letters/U.gif",x,y)
  vdata = cla.loadimagedata("./letters/V.gif",x,y)
  wdata = cla.loadimagedata("./letters/W.gif",x,y)
  xdata = cla.loadimagedata("./letters/X.gif",x,y)
  ydata = cla.loadimagedata("./letters/Y.gif",x,y)
  zdata = cla.loadimagedata("./letters/Z.gif",x,y)

  apat = [
    [adata,[1]],
    [bdata,[0]],
    [cdata,[0]],
    [ddata,[0]],
    [edata,[0]],
    [fdata,[0]],
    [gdata,[0]],
    [hdata,[0]],
    [idata,[0]],
    [jdata,[0]],
    [kdata,[0]],
    [ldata,[0]],
    [mdata,[0]],
    [ndata,[0]],
    [odata,[0]],
    [pdata,[0]],
    [qdata,[0]],
    [rdata,[0]],
    [sdata,[0]],
    [tdata,[0]],
    [udata,[0]],
    [vdata,[0]],
    [wdata,[0]],
    [xdata,[0]],
    [ydata,[0]],
    [zdata,[0]],
  ]

  an = bpnn.NN(len(adata),hiddennodes,1)
  an.train(apat)

  cla.savenn(an,filename='aznn.n')

Again like before what the above does is open up each of our sample images and then trains the network on them. I ended up playing around with the number of nodes and managed to get a low error rate with 25 inputs and 3 hidden nodes. This is interesting as the last network used 400 inputs and 3 hidden nodes, and at first I was skeptical if the network had learnt this pattern correctly.

Of course we need something to test the effectiveness of our network and so I created the below test script which should take care of this and should let us see if the network does work correctly.

import unittest
import Loader

class TestClassifyAfromB(unittest.TestCase):
  def setUp(self):
    self.c = Loader.Loader()
    self.x = 10
    self.y = 10

  def testLearnA(self):
    n = self.c.loadnn(filename='aznn.n')
    guess = n.guess(self.c.loadimagedata("./letters/A.gif",self.x,self.y))
    self.assertTrue(guess[0] > 0.95)

  def testLearnB(self):
    n = self.c.loadnn(filename='aznn.n')
    guess = n.guess(self.c.loadimagedata("./letters/B.gif",self.x,self.y))
    self.assertTrue(guess[0] < 0.05)

  def testLearnC(self):
    n = self.c.loadnn(filename='aznn.n')
    for let in 'B2 B3 C D E F G H I J K L M N O P Q R S T U V W X Y Z'.split(' '):
      guess = n.guess(self.c.loadimagedata("./letters/%s.gif"%(let),self.x,self.y))
      self.assertTrue(guess[0] < 0.05)

if __name__ == '__main__':
  unittest.main()

The above is just a quick and dirty test and the results of which are,

$python TestStep2.py
...
----------------------------------------------------------------------
Ran 3 tests in 0.015s

OK

All good! The next goal is to build a large sample of different letters in different fonts and get the network to pick out the letter A from many examples. This will indicate that it has learnt the pattern of what an A looks like rather then the letter A as given in the above examples.