Why you should never ask permission to clean up code

This is something that took me 2 years or so to learn. One day I realised nobody was really looking at my timecards in depth so I started allocating extra time to things and using the extra time to fix the things I thought needed fixing. Once I started delivering on this I showed my manager who agreed that it was a good use of time. I was given free reign to fix anything I felt would add maximum value, provided the bug fixes continued to be delivered without any major compromise.

Since that time I have re-factored quite a few code-bases; added unit tests, fixed some build processes, improved performance and generally feel happier at work for getting things done that are important to me.

Don’t get stuck in constant bug fix mode. If you cant get approval to fix things then change jobs because bug fix after bug fix is depressing and will bring you down.

Why is storing, tracking and managing billions of tiny files directly on a file system a nightmare?

Its a real pain when you want to inspect the files, delete or copy them.

Try taking 300,000 files and copy them somewhere. Then copy 1 file which has the size of the 300,000 combined. The single file is MUCH faster (its also why we usually do a tar operation before copying stuff if its already compressed). Any database that’s not a toy will usually lay the 300,000 records out in a single file (depending on settings, sizes and filesystem limits).

The 300,000 files end up sitting all over the drive and disk seeks kill you at run-time. This may not be true for a SSD but I don’t have any evidence to to suggest this or otherwise.

Even if the physical storage is fine with this I suspect you may run into filesystem issues when you lay out millions if not hundreds of millions of files over a directory and then hit it hard.

I have played with 1,000,000 files before when playing with crawling/indexing things and it becomes a real management pain. It may seem cleaner to lay each out as a singe file but in the long run if you hit a large size the benefits aren’t worth it.

Counter-counter argument TDD

The following is taken from my response to a Hacker News comment. The comment follows (quoted) and my response below.

“I will start doing TDD when,

1. It is faster than developing without it.
2. It doesn’t result in a ton of brittle tests that can’t survive an upgrade or massive change in the API that is already enough trouble to manage on the implementation-side- even though there may be no functional changes!

Unit tests that test trivial methods are evil because the LOC count goes up”

1. It can be. For something like a standard C# MVC application (Im working on one now) the time taken to spin up Casini or deploy to IIS is far greater then running tests. For something like PHP where you are just hitting F5 and TDD can slow you down. As with most things it depends.

2. If you are writing brittle tests you are doing it wrong.

Increasing LOC (lines of code) isn’t always a bad thing. If those increased LOC improve quality then I consider it a worthwhile. Yes it can be more maintenance, but we know the cost of catching bugs in development is much cheaper then in production.

Mocking isn’t as bad as its been made out to be. Yes you can overmock things (a design anti-pattern), but that should be a sign of code smell and you should be re-factoring to make it simpler. If you cant re-factor and you cant easily mock then consider if you really need to test it. In my experience things that are hard to mock and cannot be re-factored usually shouldn’t be tested.

Exception being legacy code, but we are talking about TDD here which usually means greenfield development or else it would have tests already.

Unit testing does NOT promote 100% coverage. People using unit tests as a measure promote this. Sometimes its worth achieving, and sometimes its not. Use common sense when picking a unit test coverage metric. I have written applications with close to 100% coverage such as web-services and been thankful for it when something breaks and I needed to fix it. I have also written applications with no more then 20% over the critical methods (simple CRUD screens). Use common sense, testing simple getters and setters is probably a waste of time so don’t do it.

Unit testing isn’t all about writing tests. Its also about enforcing good design. Code that’s easily testable is usually good code. You don’t have to have tests to have testable code, but if you are going to that effort anyway why not add where they can add value and provide you with a nice safety harness?

Most of the issues with unit tests come with people preaching that they are a silver bullet. For specific cases they can provide great value and increase development speed. Personally I will continue to write unit tests, but only where my experience leads me to believe they will provide value.

Can anyone explain how this regex [- ~] matches ASCII characters?

Since I am pulling most of my content from other sites such as Mahalo and Quora I thought I would pull back some of my more interesting HN comments.

Can anyone explain how this regex [- ~] matches ASCII characters ?

It’s pretty simple. Assuming you know regex… Im going to assume you don’t since you are asking.

The bracket expression [ ] defines single characters to match, however you can have more then 1 character inside which all will match.

[a] matches a
[ab] matches either a or b
[abc] matches either a or b or c
[a-c] matches either a or b or c.

The – allows us to define the range. You can just as easily use [abc] but for long sequences such as [a-z] consider it short hand.

In this case [ -~] it means every character between <space> and <tilde>, which just happens to be all the ASCII printable characters (see chart in the article). The only bit you need to keep in mind is that <space> is a character as well, and hence you can match on it.
You could rewrite the regex like so (note I haven’t escaped or anything in this so its probably not valid)

[ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~]

but that’s not quite as clever or neat.

Quora answer about writing a search engine

The following I posted on Quora in response to the question “I am planning to make a small scale search engine on my local system, but I don’t know from where to start?”. It’s a reasonable answer so like my Mahalo one I thought I would make a copy for myself.

I agree with Wolf Garbe and that you are better off in your case starting with existing technologies, have a look at http://yacy.net/ and SphinxSearch as well. However if you are doing this to learn and not just deliver a product I can provide a few links for you.

For your specific questions,

1. How do I use hashing for efficient search operation ?

You are talking about having an inverted index I suspect. Have a look at the articles above which discuss the inverted index. Keep in mind you have options here. Such as inverted index or a full inverted index. The latter is useful if you want to do thinks like proximity searches and the like. For hashing itself,

Which hashing algorithm is best for uniqueness and speed?

Be careful when using hashes with URL’s. While the square root of the number
We Worship MD5, the GOD of HASH of URL’s is still a lot bigger then the current
web size if you do get a collision you are going to get pages about Britney Spears
when you were expecting pages about Bugzilla. Look into using bloom filters to avoid
these issues (assuming you get to sufficient scale).

2. How I will manage the data and ?

Up-to you. For small scale I would just use whatever database you are most familiar with. Any SQL database will scale up to hundreds of millions of records without too many issues.

3. How my searching algorithm would work ?

This is up-to you as well. You are the one in control here. Assuming you want to get something up and running as soon as possible I would do the following.

Write a simple crawler and start crawling. (for url; get url; find urls;) is about
all you need. For seeding use wikipedia’s data dump, the alexa top lists or dmoz
data.

Build a simple inverted index indexer and index as you go. Limit your index to small portions of text (title, meta tags etc..) for the moment till you get the kinks
ironed out. If your indexer is not using 100% of the CPU rethink your approach as it is wrong.

Build a simple ranker (just rank numbers of words in documents for the moment). DO NOT DO PAGE RANK! This step will save you a lot of time while getting everything else working.

Build it by default to be an OR engine (this saves you writing a query parser or
working out how to intersect two 10 million document lists quickly).

Be sure to use a stemmer from the following Stemming Algorithm. Implement a fairly large amount of stop words and ignore anything less then 3 characters in length.

The above should be enough to occupy you for several weeks at least.

Here is a link to a collection of articles on how to start building a search engine.

Want to write a search engine? Have some links

I have copied the article below, but the above link I tend to update from time to
time as I find new articles.

PHP Search Engine – Yioop!

This one is fairly fresh and talks about building and running a general purpose
search engine in PHP.

About Us - Gigablast

This has been defunct for a long time now but is written by Matt Wells (Gigablast and Procog) and gives a small amount of insight into the issues and problems he worked through while writing Gigablast.

Why Writing Your Own Search Engine Is Hard

This is probably the most famous of all search engine articles with the exception of the original Google paper. Written by Anna Patterson (Cuil) it really explores the basics of how to get a search engine up and running from crawler to indexer to
serving results.

A Conversation with Matt Wells

A fairly interesting interview with Matt Wells (Gigablast and Procog) which goes into some details of problems you will encounter running your own search engine.

Building a Search Engine

This has a few articles written about creating a search engine from scratch. It
appears to have been on hold for years but some of the content is worth reading. If nothing else its another view of someone starting down the search engine route.

blekko | spam free search

Blekko’s engineering blog is usually interesting and covers all sorts of
material applicable to search engines.

http://www.boyter.org/2013/01/co…

This is a shameless plug but I will even suggest my own small implementation. Its essentially a walk though a write of a search engine in PHP. I implemented it and it worked quite well with 1 million pages serving up reasonable results. It actually covers everything you want, Crawling, Indexing, Storing, Ranking with articles explaining why I did certain things and full source code here Phindex

The Anatomy of a Search Engine

The granddaddy of search papers. Its very old but outlines how the original version of Google was designed and written.

open-source-search-engine

Gigablast mentioned above has since become an Open source project hosted on Github. Personally I am still yet to look through the source code but you can find how to run it on the developer page and administration page.

High Scalability – High Scalability – DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing
High Scalability – High Scalability – The Anatomy of Search Technology: blekko’s NoSQL database
High Scalability – High Scalability – Challenges from large scale computing at Google
High Scalability – High Scalability – Google’s Colossus Makes Search Real-time by Dumping MapReduce
High Scalability – High Scalability – The Three Ages of Google – Batch, Warehouse, Instant

The above are fairly interesting. The blekko one is the most technical. If you only have time to read one go with the blekko one.

Another thing you might want to consider is looking through the source of existing
indexing engines like Solr and Sphinx. I am personally running through the initial
version of the Sphinx engine and will one day write a blog about how it works.

Here are a few other links (disclaimer I wrote all of those) showing how to implement the vector space model (a good thing to start with as it does ranking for you)

Vector Space Search Model Explained
Building a Vector Space Indexing Engine in Python
GoLang Vector Space Implementation
C# Vector Space Implementation

and here is a much better article which explains the math behind it,

Page on La2600

For snippet extraction I have another article here,

Building a Search Result Extract Generator in PHP

For crawling here is another one,

Why Writing a Web Crawler isn’t Easy

Lastly if you do go about and write your own search engine please write blog posts or articles about it. Its quite hard to find this sort of information, especially from the big boys (Google, Bing, Yandex, Yahoo) and I would love to see more articles about it.

Introducing SingleBugs the Bug Tracker for Single Developers

SingleBugs Logo
Introducing the first beta release of SingleBugs, the bug tracker aimed at single/solo developers. Are you a solo developer? Do you find setting up Mantis/Fogbugs/Bugs.net/et.al too complex and a waste of your time? Do you want a single solution that easily syncs and backs up between your machines? Try SingleBugs.

Guaranteed to save you time and money setting up a bug tracker. Guaranteed to be the fastest bug tracker you have ever used.

No set-up required. Simple backups. Instant searching. Easily exportable.

With the marketing spiel over give it a shot. Currently only available for Windows x64 and Linux x64. Download the demo below and give it a shot.


demo_singlebugs_x64_linux

demo_singlebugs_x64_windows

SingleBugs Screenshot SingleBugs Screenshot SingleBugs Screenshot SingleBugs Screenshot

The Fizzbuzz Bug Tracker A Bug Tracker for Single Developers

This serves as the announcement of my new bug tracker product I am working on.

My needs are pretty specific and none of the existing bug trackers I have tried have meet the following goals.

  • Speed. Searching, adding projects/issues/comments should be instant. Any time waiting on the bug tracker is wasted time.
  • Outlook style view of projects/issues/comments. This should allow me to get an overview of how I am tracking.
  • Not require me to spin up a full web-server. I don’t want to have to install a full stack web-server just to track bugs.
  • Sync across all devices. I want it to just appear on every device I own without me having to worry about backups or keeping things in sync.
  • Be single user. I don’t want to have to login or perform user management.
  • Support multiple projects. I also want to be able to have multiple issues per project.

With that in mind let me introduce the Fizzbuzz Bug Tracker. Fizzbuzz is a single executable that works on Windows, Linux and OSX (once I have a machine to compile it on). A suggested workflow is to copy it onto your Dropbox folder (or equivalent) where it will be synced across all devices. Fizzbuzz has a strong emphasis on speed. Adding projects or issues, searching across projects and issues, drilling into projects or issues is near instant.

Fizzbuzz spins up its own web-server running on whatever port you choose (defaults to 8080) allowing you to use it without installing any additional software. Fizzbuzz has no user login’s so getting started is very fast.

Interested? Let me know. Its getting very close to having a public release at which point I will probably choose a real name for it. If you would like to get access sooner rather then later email me or bitmessage me (details on the right bar and about page).

1234

 

Collection of Letters for Neural Network OCR Training

I was looking for this on Google the other day and unable to find it. Essentially what I needed was a collection of images which are all the same size, but of different fonts so that I use them for training Neural Networks and test other OCR techniques. Since I couldn’t find any I thought I would upload my own collection.

I used the below images when working on my thesis. From memory over 20 different fonts and sizes were used to create about 200 examples of each letter. The full data set proved to be pretty accurate when it came to recognizing most examples of text I found on the web.

The attached collection of images were generated using a script. It essentially just generated a number of images each which has a letter contained in it. Then another script which finds the location of the letter in the image, and crops to just that image and then resizes it to a specific size and are then saved in an appropriate directory. The full training set can be downloaded below

Collection of letters for CAPTCHA/OCR/Neural Network training.

The PHP program for generating the images is included below. All you need do is add some fonts into the referenced fonts directory and it should generate images for you.

set_time_limit(0);
$files1 = scandir("./fonts/");
array_splice($files1,0, 1);
array_splice($files1,0, 1);
$file1totalcount = count($files1);
$file1count = 0;
$letters = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
//$letters = "a b c d e f g h i j k l m n o p q r s t u v w x y z";
$array = explode(" ",$letters);
$number = 200;

foreach($array as $letter) {
 for($i=0;$i<$number;$i++) {
  $im = imagecreatetruecolor(500, 300);
  // Create some colors
  $white = imagecolorallocate($im, 255, 255, 255);
  $grey = imagecolorallocate($im, 128, 128, 128);
  $black = imagecolorallocate($im, 0, 0, 0);
  imagefilledrectangle($im, 0, 0, 800, 800, $black);

  $font = './fonts/'.$files1[rand(0,$file1totalcount-1)];	
  imagettftext($im, rand(15,30), 0, rand(30,200),rand(20,250), $white, $font, $letter);
  if(!is_dir($letter))
   mkdir($letter);
   imagegif($im,"./".$letter."/$i.gif");
   imagedestroy($im);
  }
}

Saving Resources for Humans in PHP

One of the issues I have run into running searchcode.com is that a huge amount of time spent serving pages is serving them to bots (about 900,000 hits a day are from bots). I have been watching the load averages and they can spike to over 30.00 occasionally which is pretty bad for a 4 core/ht system. I don’t have any problems with bot’s really other then the fact that I cannot really control them. Sure you can specify crawl delays but its of limited use if the bot chooses to ignore it.

Another solution is to slow them down. Of course you don’t want to go too crazy with this as I do want my site indexed but not at the expense of real users. Thankfully with PHP this can be accomplished using the sleep and usleep functions.

if(isHuman() == false) {
  usleep(500000); // 1,000,000 is 1 second
}

The above is how you can implement a simple limiter to slow the bots down. Based on if the hit is human or not it will sleep for half a second. How you implement the isHuman method is up to you but a quick search shows the most common way is to check the user agent using $_SERVER['HTTP_USER_AGENT'] and based on that work out if the user is a bot. Check your framework of choice as well as its a pretty common thing to want to do. Something else to consider is adding a much larger delay for bots that visit pages of your site you do not want them visiting. It should encourage them to crawl elsewhere.

The results? Well the average load for searchcode is below 2.10 most of the time which is a massive improvement. I have not seen it spike any more then 7.00 which is acceptable for spikie periods.

C# Vector Space Implementation

Since I am writing lots of Vector Space implementations in Go, Python etc… I thought I would add another one in C#. This one is a little more verbose then either the Python or Go implementations. The verbosity is mostly due to not using any of the nice C# LINQ functionality which would really reduce the size.

In any case here it is in case you are looking for a simple implementation of this useful class.

class Program
{
	static void Main(string[] args)
	{
		var v = new VectorCompare();

		var con1 = v.Concordance("this is a test");
		var con2 = v.Concordance("this is another test");

		var t = v.Relation(con1, con2);

		Console.WriteLine(t);
		Console.ReadLine();
	}
}

public class VectorCompare
{
	public double Magnitude(Dictionary<string, int> con)
	{
		Double total = 0;
		foreach (var t in con)
		{
			total += Math.Pow(Convert.ToDouble(t.Value), 2);
		}

		return Math.Sqrt(total);
	}

	public double Relation(Dictionary<string, int> con1, Dictionary<string, int> con2)
	{
		var relevance = 0;
		var topvalue = 0;

		foreach(var t in con1)
		{
			if(con2.ContainsKey(t.Key))
			{
				topvalue += t.Value * con2[t.Key];
			}
		}

		var mag = Magnitude(con1) * Magnitude(con2);

		if(mag != 0)
		{
			return topvalue / mag;
		}
		return 0;
	}

	public Dictionary<string, int> Concordance(string document)
	{
		var con = new Dictionary<string, int>();

		foreach (var word in document.ToLower().Trim().Split(' '))
		{
			if (!string.IsNullOrWhiteSpace(word))
			{
				if (con.ContainsKey(word))
				{
					con[word] = con[word] + 1;
				}
				else
				{
					con[word] = 1;
				}
			}
		}

		return con;
	}
}