C# as a Language from old Google+ Post

The more I use C# as a language for writing things the more I am convinced that its approach really is the best language approach out there.

The unit test support is excellent which allows development speed to be just as fast as any dynamic language (Python, PHP, Perl).

The static typing catches so many issues before you get to runtime and allows sweeping changes without breaking things.

Unlike Java it has the var keyword (saves time and improves readability) and so many more useful functions which yes you can replicate but are just built in and work correctly.

Then you get to the really good stuff. LINQ is awesome. The lazy loading allows you to implement a repository pattern over your database which is just awesome. Set up the basic select * from then add extension methods allowing you to chain whatever you need, EG

from person in _dbContext.GetPerson().ByUserName(username).ByPassword(password);

100% elegant, easy to test, easy to write, easy to read and understand and generally works exactly as you would expect without any hidden gotchas. And because its lazy it doesn’t chew resources sucking back everything from the database.

You can use functional programming techniques if you wish, and with the new async decorators you can work in a node.js style if you with, with static typing and all existing library support.

Or you can continue to work in a C like manner, or mix it up with objects, procedural code and functional.

I switched back to Java not that long ago to write a simple server using Jetty and even with things like Guice (best DI implementation I have used so far) and Guava it was still painful. Less painful, but I really felt that the compiler was fighting me from doing things in an elegant manner most of the time. Even adding the “var” keyword would improve Java in a massive way. Add some functional programming in there and I would be pretty happy.

I just wish C# would run on the JVM as I would use it for pretty much everything in a heartbeat. As it is the Mono support is missing the stuff I really want and isn’t as seamless as the experience should be. A pity really as C# really is in my experience the nicest language to work today that’s production ready.

A story about Hubris and Integration Tests

Philip Dormer Stanhope, 4th Earl of Chesterfield (pictured) managed to embarrass me in front of my peers once. Sort of. In truth it was my hubris that caused the incident. Here is how it happened and what I learnt through the process.

Philip Dormer Stanhope, 4th Earl of Chesterfield

In the summer of 2010 I was tasked with developing a new application where I worked. The requirement was fairly simple “We need a web application to upload a CSV”. Requirements such as this aren’t exactly conducive to a good outcome but I was confident that given the data required to upload it would be fairly easy to do.

The data requirements came in and I got to work. At the time the only option for custom software where I was working was C#, LINQ to SQL, Webforms and SQL Server. Not a huge problem as I like all of those except for Webforms. Thankfully since it was only a simple file upload I didn’t have much to do there. I had just jumped on the TDD bandwagon and I quickly mocked away the data context (harder to do then you would think in LINQ) and I tested the heck out of the application. Ignoring the Webform component we were looking at 99% test coverage. I even threw in some mutation testing. Dates were checked from the year 1 to the year 9999, integers parsed correctly, string lengths verified. Everything was above board and I proudly stated that our tester would not find any issues in the code.

2.5 minutes. That’s approximately how long it took for him to find a bug that crashed the application. I was bright red and scrambling to figure out the problem.

Integration is hard. Really hard. Ask anyone where most of their debugging time is and odds are they will say when integration occurs. My tester like any good tester had started with some boundary tests. Integers over 2,147,483,647 (signed), strings over the max length, and dates in the year 1 and 9999.

Wait a minute, didn’t you just say that was tested? Yep I did. For the exact condition that threw the error too. Turns out that SQL Server only supports dates from 1753, whereas .NET supports 1 and 9999. Why does SQL server only support dates from 1753? That’s due to to the Calendar (New Style) Act 1750 which our friend Philip Stanhope debated for. Turns out the Sybase developers didn’t want to add the additional code to calculate dates correctly before 1753 and after so they set that date to be the epoch. This of course caused my code the blow up, and me to get very embarrassed.

The fix BTW to check if its a valid SQL server date is pretty simple,


    static bool isValidSqlDate(DateTime date)
    {
        return ((date >= (DateTime)SqlDateTime.MinValue) && (date <= (DateTime)SqlDateTime.MaxValue));
    }

So what did I learn from this experience?

Well number one is that unit tests, no matter how thorough can ever be a replacement for integration tests. Even PERFECT unit tests (which I believed mine to be) would not account for something like this. The take away being unit test by all means, but don’t consider it rock solid till you have actually run it. Integration test that sucker to be sure everything will work at run time.

The second is that it pays to be quietly confident rather then vocally confident. Sure hubris is usually considered a good programmer trait (debatable), which is usually fine, till you end up looking like a fool like I did.

For fun I have included my code comment below I added once I figured this all out. Hail to ye o’ merry man indeed.


    /*
     *      Philip Dormer Stanhope, 4th Earl of Chesterfield
     *      
     *      it is because of him we need to validate dates before 1753 in SQL Server
     *      hail to ye o' merry man.
     * 
     *      http://en.wikipedia.org/wiki/Chesterfield%27s_Act
     *
     */

C# XML Cleaner Regex

One of the most annoying things I deal with is XML documents with invalid characters inside them. Usually caused by copy pasting from MS Word it ends up with invisible characters that you cannot easily find and cause XML parsers to choke. I have encountered this problem enough that I thought a quick blog post would be worth the effort.

As such here mostly for my own reference is a regular expression for C# .NET that will clean invalid XML characters from any XML file.

const string InvalidXmlChars = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";

Simply take the XML document as a string and do a simple regular expression replace over it to remove the characters. You can then import into whatever XML document structure you want and process it as normal.

For note the reason I encountered this was when I was building a series of WSDL web-services over HR data. Since the data had multiple sources merged during an ETL process and some of them were literally CSV files I hit this issue a lot. Attempts to sanitize the data failed as it was overwritten every few hours and it involved multiple teams to change the source generation. In the end I just ran the cleaner over every field before it was returned and everything worked perfectly.

Implementing C# Linq Distinct on Custom Object List

Ever wanted to implement a distinct over a custom object list in C# before? You quickly discover that it fails to work. Sadly there is a lack of decent documentation about this and a lot of FUD. Since I lost a bit of time hopefully this blog post can be picked up as the answer.

Thankfully its not as difficult as you would image. Assuming you have a simple custom object which contains an Id, and you want to use that Id to get a distinct list all you need to do is add the following to the object.

public override bool Equals(object obj)
{
	return this.Id == ((CustomObject)obj).Id;
}

public override int GetHashCode()
{
	return this.Id.GetHashCode();
}

You need both due to the way that Linq works. I suspect under the hood its using a hash to work out whats the same hence GetHashCode.

C# Vector Space Implementation

Since I am writing lots of Vector Space implementations in Go, Python etc… I thought I would add another one in C#. This one is a little more verbose then either the Python or Go implementations. The verbosity is mostly due to not using any of the nice C# LINQ functionality which would really reduce the size.

In any case here it is in case you are looking for a simple implementation of this useful class.

class Program
{
	static void Main(string[] args)
	{
		var v = new VectorCompare();

		var con1 = v.Concordance("this is a test");
		var con2 = v.Concordance("this is another test");

		var t = v.Relation(con1, con2);

		Console.WriteLine(t);
		Console.ReadLine();
	}
}

public class VectorCompare
{
	public double Magnitude(Dictionary<string, int> con)
	{
		Double total = 0;
		foreach (var t in con)
		{
			total += Math.Pow(Convert.ToDouble(t.Value), 2);
		}

		return Math.Sqrt(total);
	}

	public double Relation(Dictionary<string, int> con1, Dictionary<string, int> con2)
	{
		var relevance = 0;
		var topvalue = 0;

		foreach(var t in con1)
		{
			if(con2.ContainsKey(t.Key))
			{
				topvalue += t.Value * con2[t.Key];
			}
		}

		var mag = Magnitude(con1) * Magnitude(con2);

		if(mag != 0)
		{
			return topvalue / mag;
		}
		return 0;
	}

	public Dictionary<string, int> Concordance(string document)
	{
		var con = new Dictionary<string, int>();

		foreach (var word in document.ToLower().Trim().Split(' '))
		{
			if (!string.IsNullOrWhiteSpace(word))
			{
				if (con.ContainsKey(word))
				{
					con[word] = con[word] + 1;
				}
				else
				{
					con[word] = 1;
				}
			}
		}

		return con;
	}
}

Clean Repository Data Access in C#

Mostly as a self reference here is an extremely clean data access pattern possible using C# and Entity Framework. It saves you the effort of mocking the database context as the code you end up writing is so simple it is all compile time checked.

Essentially you define a very simple class which provides a single method for getting data (although you may want a save data method too) and make sure you add an interface to make unit testing/mocking easier.

public interface IUrlRepository
{
	IQueryable GetUrl();
	void Save(Url url);
}

public class UrlRepository : IUrlRepository
{
	public DbContext _context = null;

	public UrlRepository()
	{
		_context = new DbContext();
	}

	public IQueryable GetUrl()
	{
		return from u in _context.Urls
			   select u;
	}

	public void Save(Url url)
	{
		_context.Urls.AddObject(url);
		_context.SaveChanges();
	}
}

As you can see rather then returning a list you return an IQueryable. Because entity framework is lazy you can then add extension methods over the return like so.

public static class UrlRepositoryExtention
{
	public static IQueryable ByCreatedBy(this IQueryable url, string User)
	{
		return url.Where(p => p.Created_By.Equals(User));
	}

	public static IQueryable OrderByCreateDate(this IQueryable url)
	{
		return url.OrderByDescending(x => x.Create_Date);
	}
}

With this you end up with a very nice method of running queries over your data.

var url = _urlRepo.GetUrl().OrderByCreateDate();

Since it can all be chained you can just add more filters easily as well.

var url = _urlRepo.GetUrl().OrderByCreateDate().ByCreatedBy("Ben Boyter");

What about joins I hear you ask? Well thankfully you this pattern takes care of this too. Just have two repositories, pull the full data set for each and do the following.

var users = _userRepo.GetUser();
var locations = _locationRepo.GetLocation();

var result =  from user in users
              join location in locations on user.locationid equals location.id && location.name = "Parramatta"
              select user;

The best thing is that its all lazy evaluation so you don’t end up pulling back the full data set into memory. Of course at a large enough scale you will probably hit some sort of leaky abstraction issue and end up rewriting to use pure SQL at some point, but for getting started this method of data access is incredibly powerful with few chances of errors.

Finally you get the advantage that you can provide pure unit tests over your joins. Because you can mock the response from your repository easily you don’t have to create a seed database and provide a connection. This is fantastic for TDD especially when running offline or on your local machine.