Saving Resources for Humans in PHP

One of the issues I have run into running searchcode.com is that a huge amount of time spent serving pages is serving them to bots (about 900,000 hits a day are from bots). I have been watching the load averages and they can spike to over 30.00 occasionally which is pretty bad for a 4 core/ht system. I don’t have any problems with bot’s really other then the fact that I cannot really control them. Sure you can specify crawl delays but its of limited use if the bot chooses to ignore it.

Another solution is to slow them down. Of course you don’t want to go too crazy with this as I do want my site indexed but not at the expense of real users. Thankfully with PHP this can be accomplished using the sleep and usleep functions.

if(isHuman() == false) {
  usleep(500000); // 1,000,000 is 1 second
}

The above is how you can implement a simple limiter to slow the bots down. Based on if the hit is human or not it will sleep for half a second. How you implement the isHuman method is up to you but a quick search shows the most common way is to check the user agent using $_SERVER[‘HTTP_USER_AGENT’] and based on that work out if the user is a bot. Check your framework of choice as well as its a pretty common thing to want to do. Something else to consider is adding a much larger delay for bots that visit pages of your site you do not want them visiting. It should encourage them to crawl elsewhere.

The results? Well the average load for searchcode is below 2.10 most of the time which is a massive improvement. I have not seen it spike any more then 7.00 which is acceptable for spikie periods.

C# Vector Space Implementation

Since I am writing lots of Vector Space implementations in Go, Python etc… I thought I would add another one in C#. This one is a little more verbose then either the Python or Go implementations. The verbosity is mostly due to not using any of the nice C# LINQ functionality which would really reduce the size.

In any case here it is in case you are looking for a simple implementation of this useful class.

class Program
{
	static void Main(string[] args)
	{
		var v = new VectorCompare();

		var con1 = v.Concordance("this is a test");
		var con2 = v.Concordance("this is another test");

		var t = v.Relation(con1, con2);

		Console.WriteLine(t);
		Console.ReadLine();
	}
}

public class VectorCompare
{
	public double Magnitude(Dictionary<string, int> con)
	{
		Double total = 0;
		foreach (var t in con)
		{
			total += Math.Pow(Convert.ToDouble(t.Value), 2);
		}

		return Math.Sqrt(total);
	}

	public double Relation(Dictionary<string, int> con1, Dictionary<string, int> con2)
	{
		var relevance = 0;
		var topvalue = 0;

		foreach(var t in con1)
		{
			if(con2.ContainsKey(t.Key))
			{
				topvalue += t.Value * con2[t.Key];
			}
		}

		var mag = Magnitude(con1) * Magnitude(con2);

		if(mag != 0)
		{
			return topvalue / mag;
		}
		return 0;
	}

	public Dictionary<string, int> Concordance(string document)
	{
		var con = new Dictionary<string, int>();

		foreach (var word in document.ToLower().Trim().Split(' '))
		{
			if (!string.IsNullOrWhiteSpace(word))
			{
				if (con.ContainsKey(word))
				{
					con[word] = con[word] + 1;
				}
				else
				{
					con[word] = 1;
				}
			}
		}

		return con;
	}
}

GoLang Vector Space Implementation

UPDATE – This is now actually available as a real Golang import with tests. Get it at https://github.com/boyter/golangvectorspace

I have mentioned this before somewhere but one of the first things I usually attempt to implement in any programming language I want to play with is a vector space. Its my own personal FizzBuzz implementation. It usually covers everything I need to know in a language (imports, functions, string manipulation, math functions, iteration, maps etc…) so I consider it a good thing to get started with.

You can see my previous implementation in Python in a previous post.

Anyway I have been playing with Go recently. After skimming though the tutorials I thought I would give my standard test the vector space a go. The below is my implementation. It’s probably full of bugs and various other issues but seems to work alright for the few tests I tried.

package main

import (
	"fmt"
	"math"
	"strings"
)

func magnitude(con map[string]float64) float64 {
	var total float64 = 0

	for _, v := range con {
		total = total + math.Pow(v, 2)
	}

	return math.Sqrt(total)
}

func concordance(document string) map[string]float64 {
	var con map[string]float64
	con = make(map[string]float64)

	var words = strings.Split(strings.ToLower(document), " ")

	for _, key := range words {

		_, ok := con[key]

		key = strings.Trim(key, " ")

		if ok && key != "" {
			con[key] = con[key] + 1
		} else {
			con[key] = 1
		}

	}

	return con
}

func relation(con1 map[string]float64, con2 map[string]float64) float64 {
	var topvalue float64 = 0

	for name, count := range con1 {
		_, ok := con2[name]

		if ok {
			topvalue = topvalue + (count * con2[name])
		}
	}

	mag := magnitude(con1) * magnitude(con2)

	if mag != 0 {
		return topvalue / mag
	} else {
		return 0
	}
}

func main() {
	var con = concordance("this is a  test of stuff yes stuff")
	var con2 = concordance("This is a     test")

	fmt.Println(relation(con, con2))
}