Small Steps 1 – Teaching a Neural Network to Learn the Letter A from B

I’m going to make the assumption that if you are reading this you already know what a NN is, and you are trying to do some sort of image recognition. I’m also going to assume you are somewhat familiar with programming preferably in Python since that’s what all the examples will be using.

Get the source to everything below in Step1.zip

To get started we are going to need the following,

  1. A neural network implementation
  2. A imaging library to read images
  3. Sample images to train and test on.

The images we are going to use are the following,

The Letter A

The Letter B

Our goal is to teach a neural network to tell the difference between the above. There are two ways we can look at feeding data into our network. The first is to just convert each of the pixels in the image into an input. So for a 10×10 image you would need 100 input neurons. The second is to try and describe what the letter looks like. This can mean describing the thickness of lines at various cut points etc…

The first is the easiest to code so I am going to go with that. The first thing we need is a way to open the image and convert it to a list of data. I am going to just iterate over each row and for each pixel that’s black add a 1 to a list otherwise a 0. I also added some methods to save and load the network to disk.

The next step is to write some code which takes the above pixel data and feeds it into the Neural Network.

import bpnn
import Loader

if __name__ == '__main__':
  cla = Loader.Loader()
  hiddennodes = 2

  adata = cla.loadimagedata("./letters/A.gif",xsize=20,ysize=20)
  bdata = cla.loadimagedata("./letters/B.gif",xsize=20,ysize=20)
  apat = [
    [adata,[1,0]],
    [bdata,[0,1]],
  ]

  an = bpnn.NN(len(adata),hiddennodes,len(apat[0][1]))
  an.train(apat)
  cla.savenn(an,filename='abnn.n')

You can see that in the above we create a loader object which allows us to read images and networks. We then load our data, put it into the correct pattern for our Neural Network to train on, then train it and save the network.

The important thing to look at here is the number of hiddennodes. This is something you will have to play with in order to make your network efficient. I have set it to 2 for the moment.

Now running the above code we get the following,

$python Step1.py
error 4.42059
error 0.71834
error 0.71834
error 0.71834
error 0.71834
error 0.71834
error 0.71834
error 0.71834
error 0.71834
error 0.71834

What the above shows is the network learning. The important thing is that the error continues to decrease. In this case it lowers a bit, then stops. The reason for this can could be one of the following,

  • Not enough hidden nodes – So not enough memory to “learn” the difference
  • Not enough training time – Unlikely since the error isn’t decreasing
  • Not enough inputs – Unlikely since we have 400 inputs but is possible

To remedy this I have increased the number of hidden nodes to 3 to increase the learning power of the network and then run it again.

$python Step1.py
error 2.93246
error 0.00048
error 0.00023
error 0.00015
error 0.00012
error 0.00010
error 0.00008
error 0.00007
error 0.00006
error 0.00006

This is much better! The error continues to decrease and almost reaches 0. This means that our network has learnt the difference between our sample letter A and B.

We can test this pretty easily with the following test script which uses unit tests to ensure we don’t break the network in the future

import unittest
import Loader

class TestStep1(unittest.TestCase):
  def setUp(self):
    self.c = Loader.Loader()

  def testLearnA(self):
    n = self.c.loadnn(filename='abnn.n')
    guess = n.guess(self.c.loadimagedata("./letters/A.gif",20,20))
    self.assertTrue(guess[0] > 0.95)
    self.assertTrue(guess[1] < 0.05)


  def testLearnB(self):
    n = self.c.loadnn(filename='abnn.n')
    guess = n.guess(self.c.loadimagedata("./letters/B.gif",20,20))
    self.assertTrue(guess[1] > 0.95)
    self.assertTrue(guess[0] < 0.05)

if __name__ == '__main__':
  unittest.main()

Running this gives,

$python TestStep1.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.031s

OK

Looks like all is well. The next step and what will be in the next blog post is training our network so that it can identify from characters it has never seen before.

Always Go To First Principles

Recently I was having an issue with some code I was working on for my pet project (A website search solution). Essentially my problem was that Smarty PHP wouldn’t loop through an array I had passed in. After much swearing and complaining I decided to take a step back and run through all of the newbie mistakes. In other words I looked at the problem from first principles.

Turns out the issue was a missing $ before the variable I was trying to loop over. DOH!!!!!

This made me think back to TechZingLive where Justin and Jason were discussing “Tell it to the bear”. You know what? They are 100% on the money in this case.

I think one of the traps that I had fallen into, (and if I have other developers probably have) is that we think we know whats going with the code we write. If something breaks its the compiler/interpreters fault, or the 3rd party software, or something to do with the alignment of your computer towards Mecca. We loose sight of the fact that the computer only does what we tell it to do. That is, whatever is playing up is most likely our own fault. I think this might be what causes NIH (not invented here) syndrome, since I can recall thinking “I should write my own template language!” while looking for the issue.

By taking that step backwards I was not only able to approach the problem correctly, I also fixed it within 30 seconds. For me its certainly something to keep in mind when trying to fix something that should just work.

Building a Vector Space Indexing Engine in Python

Ever wanted to code a search engine from scratch? Well actually its a pretty simple thing to do. Here is an example indexer I coded up in less then an hour using Python.

The first thing we need to do is have a way to take our documents we want to search on and turn them into an concordance. A concordance for those not in the know is a count of every word that occurs in a document.

def concordance(document):
  if type(document) != str:
    raise ValueError('Supplied Argument should be of type string')
  con = {}
  for word in document.split(' '):
    if con.has_key(word):
      con[word] = con[word] + 1
    else:
      con[word] = 1
  return con

The above method simply allows us to pass in a clean text document and get back a concordance of the words in that document.

The only other thing we need is a vector space. A vector space for those not in the know is a way of calculating the distances between two points. Essentially it works the same way calculating the 3rd side of a triangle. Except that instead of 2 planes (x and y) or even 3 planes (x,y,z) you can have as many planes as you want. The actual idea takes a while to understand but you can read about it here, Vector Space Search Engine Theory (PDF).

Thankfully I already have implemented the algorithm in my Decoding CAPTCHA’s post and can just copy paste it from there. I have modified it a little bit to avoid divide by zero issues, check types and to add the above concordance method in since it really does belong together.

class VectorCompare:
  def magnitude(self,concordance):
    if type(concordance) != dict:
      raise ValueError('Supplied Argument should be of type dict')
    total = 0
    for word,count in concordance.iteritems():
      total += count ** 2
    return math.sqrt(total)

  def relation(self,concordance1, concordance2):
    if type(concordance1) != dict:
      raise ValueError('Supplied Argument 1 should be of type dict')
    if type(concordance2) != dict:
      raise ValueError('Supplied Argument 2 should be of type dict')
    relevance = 0
    topvalue = 0
    for word, count in concordance1.iteritems():
      if concordance2.has_key(word):
        topvalue += count * concordance2[word]
    if (self.magnitude(concordance1) * self.magnitude(concordance2)) != 0:
      return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
    else:
      return 0

  def concordance(self,document):
    if type(document) != str:
      raise ValueError('Supplied Argument should be of type string')
    con = {}
    for word in document.split(' '):
      if con.has_key(word):
        con[word] = con[word] + 1
      else:
        con[word] = 1
    return con

To use it you just supply two concordances (one the document and the other the query) and it returns a number from 0 to 1 of how related they are. The higher the number the more relevant the search terms are to the document.

So now all we need do, is take every document, build a concordance for it, then compare each one to our search terms, sort the results by the number returned and we are set. The documents I decided to use are the titles and first paragraph of the last 7 blogs I have posted here.

v = VectorCompare()

documents = {
  0:'''At Scale You Will Hit Every Performance Issue I used to think I knew a bit about performance scalability and how to keep things trucking when you hit large amounts of data Truth is I know diddly squat on the subject since the most I have ever done is read about how its done To understand how I came about realising this you need some background''',
  1:'''Richard Stallman to visit Australia Im not usually one to promote events and the like unless I feel there is a genuine benefit to be had by attending but this is one stands out Richard M Stallman the guru of Free Software is coming Down Under to hold a talk You can read about him here Open Source Celebrity to visit Australia''',
  2:'''MySQL Backups Done Easily One thing that comes up a lot on sites like Stackoverflow and the like is how to backup MySQL databases The first answer is usually use mysqldump This is all fine and good till you start to want to dump multiple databases You can do this all in one like using the all databases option however this makes restoring a single database an issue since you have to parse out the parts you want which can be a pain''',
  3:'''Why You Shouldnt roll your own CAPTCHA At a TechEd I attended a few years ago I was watching a presentation about Security presented by Rocky Heckman read his blog its quite good In it he was talking about security algorithms The part that really stuck with me went like this''',
  4:'''The Great Benefit of Test Driven Development Nobody Talks About The feeling of productivity because you are writing lots of code Think about that for a moment Ask any developer who wants to develop why they became a developer One of the first things that comes up is I enjoy writing code This is one of the things that I personally enjoy doing Writing code any code especially when its solving my current problem makes me feel productive It makes me feel like Im getting somewhere Its empowering''',
  5:'''Setting up GIT to use a Subversion SVN style workflow Moving from Subversion SVN to GIT can be a little confusing at first I think the biggest thing I noticed was that GIT doesnt have a specific workflow you have to pick your own Personally I wanted to stick to my Subversion like work-flow with a central server which all my machines would pull and push too Since it took a while to set up I thought I would throw up a blog post on how to do it''',
  6:'''Why CAPTCHA Never Use Numbers 0 1 5 7 Interestingly this sort of question pops up a lot in my referring search term stats Why CAPTCHAs never use the numbers 0 1 5 7 Its a relativity simple question with a reasonably simple answer Its because each of the above numbers are easy to confuse with a letter See the below''',
}

index = {
0:v.concordance(documents[0].lower()),
1:v.concordance(documents[1].lower()),
2:v.concordance(documents[2].lower()),
3:v.concordance(documents[3].lower()),
4:v.concordance(documents[4].lower()),
5:v.concordance(documents[5].lower()),
6:v.concordance(documents[6].lower()),
}

searchterm = raw_input('Enter Search Term: ')
matches = []

for i in range(len(index)):
  relation = v.relation(v.concordance(searchterm.lower()),index[i])
  if relation != 0:
    matches.append((relation,documents[i][:100]))

matches.sort(reverse=True)

for i in matches:
  print i[0],i[1]

Now running it and trying some searches.

Enter Search Term: captcha
0.124034734589 Why You Shouldnt roll your own CAPTCHA At a TechEd I attended a few years ago I was watching a prese
0.0957826285221 Why CAPTCHA Never Use Numbers 0 1 5 7 Interestingly this sort of question pops up a lot in my referr

Enter Search Term: mysql stallman
0.140028008403 Richard Stallman to visit Australia Im not usually one to promote events and the like unless I feel
0.110096376513 MySQL Backups Done Easily One thing that comes up a lot on siteslike Stackoverflow and the like is

Results are not too bad I think! Now there are some problems with this technique. Firstly it doesn’t support boolean searches which can be an issue, although most people tend to just type some terms. Secondly it has problems with larger documents. The way the vector space works is biased towards smaller documents since they are closer to the search term space. You can get around this by breaking larger documents up into smaller ones though. The final and biggest issue though is that it is pretty CPU intensive. I have tested a search like this with 50,000 documents and it was OK but you wouldn’t want to go much further then that. It is a pretty naive implementation though. With some caching and checking which documents are worth comparing you could take this up to millions of documents.

I remember reading somewhere (no source sorry) that Altavista and some of the other early search engines used a technique similar to the above for calculating rankings, so it seems the idea really can be taken to a large scale.

By now I am sure someone is thinking, “Hang on, if its that simple then why is it so hard to make the next Google?”. Well the answer is that its pretty easy to index 10,000 to 100,000,000 pages it gets considerably more difficult to index 1,000,000,000+ pages. You have to shard out to multiple computers and the margin for error is pretty low. You can read this post Why Writing a Search Engine is Hard written by Anna Patterson (one of the co-founders of Cuil) which explains the problem nicely.

A few people have expressed difficulty getting the above to run. To do so just copy it all into a single file and run it.

Why CAPTCHA’s Never Use Number’s 0 1 5 7

Interestingly this sort of question pops up a lot in my referring search term stats.

Why CAPTCHA’s never use the numbers 0 1 5 7

Its a relativity simple question with a reasonably simple answer. Its because each of the above numbers are easy to confuse with a letter. See the below,

CAPTCHA With 0 and O

CAPTCHA With 0 and O

CAPTCHA With 0 and O

CAPTCHA With 1 and I

CAPTCHA With 5 and S

CAPTCHA With 5 and S

CAPTCHA With 7 and J L I

CAPTCHA With 7 and J L I

Are you able to tell the difference? For some yes, others, certainly not. For those wondering the first character is the number and the rest are letters. In the format “number dash letter letter”.

They all look fairly similar to a human, especially when they are warped and made fuzzy and all of the other stuff a CAPTCHA does to make OCR (Character recognition) difficult. Interestingly you can end up with the unusual situation that the CAPTCHA is easier to decode for a computer then a human when you do this since it can just churn through thousands of results get a majority right and still successfully spam a website.

The CAPTCHA used to create the images in this post can be found here http://milki.erphesfurt.de/captcha/ Which I discovered in a comment by Mario to my own post about why you shouldnt write your own CAPTCHA’s. Its a pretty good CAPTCHA as far as CAPTCHA’s go, and I had to modify it to produce the results above. Out of the box it never displays similar text like this. If you do insist on using a CAPTCHA on your site I highly suggest having a look at it.

Setting up GIT to use a Subversion (SVN) style workflow

Moving from Subversion SVN to GIT can be a little confusing at first. I think the biggest thing I noticed was that GIT doesn’t have a specific work-flow; you have to pick your own. Personally I wanted to stick to my Subversion like work-flow with a central server which all my machines would pull and push too. Since it took a while to set up I thought I would throw up a blog post on how to do it.

First on your server or wherever you are going to host your repository you need to create a bare repository. I created a new user on my server called repo which holds all of my repositories. I then created a directory which would hold my repository. This is similar to naming your repository in subversion.

$pwd
/home/repo
$mkdir newrepository

I then change directory into the newly created directory and create a git bare repository.

$cd newrepository
$git --bare init
Initialized empty Git repository in /home/repo/newrepository/

This creates an git bare repository which has no branches (not even master!). You can now log-off your server and from your client clone the repository. I use SSH for this, but you can clone in any other way you choose.

$git clone ssh://repo@servername/home/repo/newrepository
repo@servername's password:
warning: You appear to have cloned an empty repository.

Now that we have our repository cloned, the next thing to do is add whatever files you want have source control on, and then push them.

$git add .
$git commit -a -m 'Inital Commit'
$git push
repo@servername's password:
No refs in common and none specified; doing nothing.
Perhaps you should specify a branch such as 'master'.

Whoops. What went wrong there? Remember I said when you create a bare repository it dosnt even have branches? You need to push your master branch up to create it. Run the following.

$git push origin master
repo@servername's password:
Counting objects: 3, done.
Writing objects: 100% (3/3), 219 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
To ssh://repo@servername/home/repo/newrepository
 * [new branch]      master -> master
$

Done. Everything you wanted to commit will be pushed up and you are set to go. Now this and any other machine you are working on can clone as per normal and start taking advantage of fast local branches.

Why You Shouldn’t roll your own CAPTCHA

At a TechEd I attended a few years ago I was watching a presentation about Security presented by Rocky Heckman (read his blog its quite good). In it he was talking about security algorithms. The part that really stuck with me went like this,

“Don’t write your own Crypto algorithms unless you have a Doctorate in Cryptography.” Interestingly someone there did have said qualification, and Rocky had to make an exception for that single person.

None the less I think this sort of advice can be applied to all sorts of situations. In particular one that really strikes close to me heart is CAPTCHA’s. So following the words of Rocky I will make a simple statement.

“Don’t write your own CAPTCHA’s unless you have a Doctorate in Machine Vision”

ReCAPTCHA Example

A Difficult CAPTCHA to break

Now you are probably going to ask why? The reason is quite simple really. Unless you know what sort of attacks your CAPTCHA is going to experience then you don’t know how to defend against those attacks.

I’m going to pull a figure out of the air here but I would say that 90% of the home

Sample trivial CAPTCHA

A trivial CAPTCHA to break.

grown CAPTCHA’s out there on the internet are trivial to crack. Now the owners of these CAPTCHA’s will point out a reduction in spam since they implemented it as a proof of the success of their CAPTCHA but frankly thats a flawed argument. I implemented a simple CAPTCHA on another site of mine where all you have to do is enter the word “human” into a text box. Guess what? 100% spam eradication.

See the thing is, if there is money to be gained by defeating your CAPTCHA then someone out there will. Personally I have written CAPTCHA crackers for people from time to time. Guess what, most of them took less then an hour to break including time for downloading samples and tweaking to get better results.

Sample Captcha

A trivial CAPTCHA to break.

Another thing to consider is accessibility. About 99% of the home grown CAPTCHA’s out there dont even consider the fact that there are sight impaired people around who need text to speech. This becomes a huge issue in countries like England which requires that websites be accessible.

Finally its well known that you can pay people to crack a number of CAPTCHA’s for you, or even offer them porn or something and have them crack it for you without knowing.

So whats the conclusion to all of this? If you have a simple blog or website and a problem with automated spam, just add a simple “Enter the word human” text-box. It will be 100% effective, is easy to implement and won’t annoy your users. If you have something to protect and your CAPTCHA is being targeted, use an external service, which will provide a good accessible CAPTCHA that will be updated when it gets broken (which it will!). A custom CAPTCHA might seem like a good idea at the time, but its only a roadblock to someone who has any incentive to breaking in.

A Sample CAPTCHA

A trivial CAPTCHA to break.

If however you are the sort of person who looks at ReCAPTCHA and thinks “I can break that” knows when to apply Neural Networks or Support Vector Machines, knows that GIMPY is, and has post graduate studies in the field of machine vision by all means create your own CAPTCHA. Just don’t complain when you have to update it every 6 months because someone with something to gain has defeated it.

For those interested my postgrad honours thesis was on applying CAPTCHA decoding techniques against web images as a method of improving search results. You can find a simple tutorial with code about how it was done here, Decoding CAPTCHA’s

MySQL Backups Done Easily

One thing that comes up a lot on sites like Stackoverflow and the like is how to backup MySQL databases.The first answer is usually use mysqldump. This is all fine and good, till you start to want to dump multiple databases. You can do this all in one like using the –all-databases option however this makes restoring a single database an issue, since you have to parse out the parts you want which can be a pain.

So the question is, have you ever wanted to script mysqldump to dump each database into a seperate gziped file?

The below script is what I use for backing up multiple databases, and does the above.

#!/bin/sh
date=`date -I`
for I in $(mysql -u root -pPASSWORD -e 'show databases' -s --skip-column-names);
do
  mysqldump -u root -pPASSWORD $I | gzip > "$date-$I.sql.gz";
done

Its a simple bash script, which connects to MySQL, prints out all the databases and then uses each line as a separate argument for mysqldump. All the databases are saved in their own file and restoring a single database is easy.

At Scale You Will Hit Every Performance Issue

I used to think I knew a bit about performance, scalability and how to keep things trucking when you hit large amounts of data. Truth is I know diddly squat on the subject since the most I have ever done is read about how its done. To understand how I came about realising this you need some background.

Essentially what I have been working on and hope to launch soon is a highly vertical search engine that websites can employ on their site and get highly relevant search results. Something like Googles website search, but custom for your website with tight API integration or just a simple “index my website and stick a search box here” sort of thing. While doing this I have learnt more about operating at scale then I would have ever imagined.

So to begin with I had the idea, and all was good. I coded up the initial implementation pretty quickly and had it working pretty well for my initial runs of a couple of hundred pages. The next thing to do was point it at a live website and see how it goes. I seeded the crawler with about 50,000 URLs and set it loose. This is where problems initially began.

The first issue I discovered was in my crawler. I initially set it up to run as one long process which pulled down the list of URLs to crawl. This had the issue that it consumed massive amounts of memory and CPU pretty much all the time. So I made the decision to change it to a short lived process that ran every minute. All was well for a while. It fired up every minute (cron job) and sucked down 20 pages or so. This was fine till the website it was crawling slowed down a bit and it took over 1 minute to pull down the pages. The next process kicked off slowing the site down even more. Within 20 minutes about 15 instances of the crawler were hammering the site and eventually it died under the pressure.

So naturally I needed to think about this again. I added a file lock to ensure only a single instance of the process could run at any one time. Works fine till your crawler dies for some reason without releasing the lock. So I switched to a port bind and everything is hunky dory. Considering the crawler finished I moved on to other issues.

I then did some trial runs against websites crawling and indexing anywhere from 1,000 to 50,000 pages without any problems following the latest changes.

So I fired up the next step. A full index of a website. This involved loading my crawler up with a single seed URL and telling it to harvest links as it goes. The next issue I ran into was problems with the crawler not parsing out crawled websites correctly. Trying to anticipate every form of html content in a page is more difficult then you would think. The thing is when pulling down a page you need to extract the useful information you want to index on and clear out the rest. Since people search on words you need to remove all the other crap. Javascript, CSS, etc… Something I neglected to consider is that you can have in-line styles for CSS. Its one of those things I never encountered on my run of 50,000.

So some quick modifications and I’m getting clean content back.

Everything was fine till I hit the next hurdle. When getting the next batch of URL’s to crawl I do a simple bit of SQL to pull back the URL’s that need to be crawled. IE those which haven’t been hit in a while, those which don’t have issues (IE failed to respond the last 3 times) aren’t marked as deleted or have been asked to be re-crawled. Its a pretty simple bit of SQL. Guess what, all of a sudden it started to slow down. What was taking 1 second at 50,000 URL’s was suddenly taking 4 minutes at 300,000 URL’s. Now partly this was due to me changing the schema as I went, but mostly it was down to poor indexes and pulling back too much data. So some index fixes, a dump of the database and a re-import and the query is down to 1 second again.

So what have I learnt from all of this?

1. Never assume. Profile profile profile! The cause of a performance issue is never what you expect it to be. My thought it was MySQL insert performance was way off. In fact I wasted a few hours looking into something that wasn’t even an issue. I cant afford to waste time like that.

2. You will never hit any of the big issues until you actually go to a “live” state. Be prepared to spend time looking at things you wouldn’t have expected to slow down or cause issues.

3. Unit test your code! Write unit tests to prove a bug exists, then fix it. This saves time in the long run.

Why Jason of TechZing is Wrong

Let me prefix this with that I am a big fan of techzing, listen to it regularly on my daily commute (along with other pod-casts) and find most of his ideas interesting. However in the latest episode Jasonism Jason made a rather large and what I feel is wrong statement. Discussing blogs and how he was going to start one for his application Appignite he mentioned that blogs with features like search/tags/dates make no difference. He said that he was going to go with HTML files and hand write everything. With all due respect I disagree and here is why.

To begin with one of the basic principles of development, and indeed start-ups is to automate everything you can. Using a pre built or indeed custom system benefits you in a few ways. Firstly you can have your RSS auto generated. Sure you can do RSS by hand, and while the format isn’t that complex you will by law of averages make a mistake and bork every-one’s RSS feeders. Not only is this not nice it may cause people to unsubscribe. Secondly you can add in comments in an easy way. Now Jason mentioned using discuss for his comment system, which is fine, but you still need to generate a unique key for your blog to hook it up. This is something that you get for free by adding a unique id to your database. Finally you get the benefits of having all your links all generated correctly for you. While its pretty easy to add a link to your other pages, as with the RSS eventually you will make a mistake, people will get lost on your site and Google will miss your older content.

Now search is something close to my heart. Jason mentioned specifically that nobody uses site search/links. Now on one hand Google is pretty good at sucking up everything you throw on the web. So anything you post is likely to appear in it. However on the other hand this isn’t a certainty and what happens when someone is looking for that awesome post you had about using Appignite for their web application and cant find it? Do you trust Google enough that you are willing to bet singly on their ability to find content on your site? Why not just implement a search option? Its not difficult and could provide someone looking at your site with the incentive to investigate further. You never know, doing this might encourage someone to subscribe to your feed or newsletter.

Justin mentioned something about NIH (not invented here) syndrome which Jason seems afflicted with. Fair enough since most developers have the same thoughts. Coding is something fulfilling and even writing unit tests can make you feel awesome for having produced something. However how fulfilling is it to write another HTML page, and add some links at the bottom and top? If you don’t like word-press then write your own simple blog which should take a few minutes at most and automate everything I mentioned above. By all means write your html by hand, but then run a script which fixes your RSS feed, and generates everything else for you. Should take a few minutes for a coder of your skill and will make everything much smoother.

You know Jason, this would be an excellent way to showcase your project AppIgnite. It seems quite powerful, so why not generate a blog using it? Dogfooding really can be the best way to show off your stuff. Failing that I don’t see why you wouldn’t invest (yes it is an investment) an hour or so to create a simple PHP script which automates at least the above stuff for you.