Flaky Tests

A test is considered flaky or flakey if it if fails occasionally. Generally flaky tests are considered to be a bad thing and should be modified to ensure they work correctly every time. This is because a test that is not trustworthy will be ignored even when indicating real failure.

There are many situations that can cause to become flaky. Integration and acceptance tests are generally the tests in your test suite most likely to become flaky. They generally have more integrations across your software stack and as such there are more things likely to go wrong. We going to go through a few of the main reasons and go through in detail what you can do about a specific one.

The first thing that causes a test to be flaky is when it depends on any external resource. This could include an API call to a third party, accessing a database or interacting with a file on the filesystem. Generally these sorts of interactions work most of the time but on occasion will not. The reasons for these are numerous.

A test making an API call can fail for all sort of reasons. API changes, API limits, API Key rotation, network connectivity, a 3rd party with poor uptime, concurrency issues. All can cause a previously perfect test to fail. I recently was working with some tests for importing tweets into a system using the twitter API. A developer had hard coded a specific search and twitter id into the test. It turned out that twitter can choose to drop tweets from their search index and hence the API at whim. The test became flaky and needed to be refactored to compensate.

Database calls tend to fail for the same reasons. They also can fail because of incorrect dynamic SQL (note if you have this problem you are probably open to SQL injection and should use SQL binds instead!), and errors such as the database missing the expected data.

Tests which interact with the filesystem despite seeming solid can fail at inopportune moments as well. The reasons are many but the first things to look at would be concurrency issues, tests not cleaning up files, read/write permissions and not releasing file locks. Without mocking away the filesystem (which is a solution which can fix these issues and improve performance) these tests can easy become flaky.

Lets go through a concrete example using a test designed to check if a file has been written. I am using a pseudocode language similar to python but the idea’s should be the same for all languages.

Here we have a function which writes a heartbeat file to the temp directory with the current date and time. Its a commonly used pattern for daemons and other background tasks to confirm they are still running.

    def writeheartbeat():
        file = open('/tmp/website_heartbeat.txt','w+')

Here are some tests which verify that the file is missing and that when the function is called now exists.

    def testheatbeatmissing():
        exists = os.path.isfile('/tmp/website_heartbeat.txt')
    def testheatbeatexists():
        exists = os.path.isfile('/tmp/website_heartbeat.txt')

The problem with the above is simple. Assuming the tests run in order everything should be fine for the first test run. However on the second run the first test will assume that the heartbeat file will be missing, however as it would have been created from the previous run this test will now begin to fail! Worse still, if the tests run out of order or someone reorganises them such that the second becomes the first it will start to fail every time.

You could fix the above problems so that the file is cleaned up at the end of the test. This however will not cater for the situation where the tests run concurrently. This is an even worse outcome as it will be impossible to replicate using a debugger and hard to catch.

A better way to test this function would be to rewrite it such that it writes to a unique file for every test run and that file is cleaned up by the test. Such a function could be written like so.

    def writeheartbeat(filelocation = '/tmp/website_heartbeat.txt'):
        file = open(filelocation,'w+')

By default the test still writes to the same location when called without an argument but now we can write out heartbeat check test to work correctly every time.

    def testheatbeatexists():
        tempfilelocation = '/tmp/testheartbeatexits.tmp'
        exists = os.path.isfile(tempfilelocation)

Perfect. Now the test sets itself up correctly, performs the test and cleans up after itself. It should now be able to run concurrently with our other test without issue. As mentioned however for these situations you may want to look into mocking away the filesystem itself as a way to avoid the above issues.

The one thing to keep in mind when testing integrations

“Leave it as you found it.”. This applies to system state, memory, filesystems or the database. If you make a change no matter how small be sure to reverse it. This simple rule will help cut down on flaky tests saving you a lot of time.

Clean Testable Repository Data Access in C Sharp

Below is an implementation of an extremely clean data access pattern possible using C# and Entity Framework. It saves you the effort of mocking the database context as the code you end up writing is so simple it is all compile time checked.

The advantages of this are firstly that everything is very easy to test as you can perform all joins in your service layer with mocks of the repository. Secondly it makes your data layer stupidly simple allowing you to forgo writing many tests which would provide little value.

Essentially you define a very simple class which provides a single method for getting data (although you may want a save data method too) and make sure you add an interface to make unit testing/mocking easier. Lets step through the code showing how you can achieve this.

     public interface IUrlRepository { IQueryable GetUrl(); void Save(Url url); }
     public class UrlRepository : IUrlRepository 
        public DbContext _context = null;
        public UrlRepository()
            _context = new DbContext();
        public IQueryable GetUrl()
            return from u in _context.Urls
                   select u;
        public void Save(Url url)

As you can see rather then returning a list you return an IQueryable. Because entity framework is lazy you can then add extension methods over the return like so. Note you would probably want to consider injecting your DbContext through your DI framework of choice.

    public static class UrlRepositoryExtention { public static IQueryable ByCreatedBy(this IQueryable url, string User) 
    { return url.Where(p => p.Created_By.Equals(User)); }
        public static IQueryable OrderByCreateDate(this IQueryable url)
            return url.OrderByDescending(x => x.Create_Date);

With this you end up with a very nice method of running queries over your data.

    var url = _urlRepo.GetUrl().OrderByCreateDate();

Since it can all be chained you can just add more filters easily as well.

    var url = _urlRepo.GetUrl().OrderByCreateDate().ByCreatedBy("Ben Boyter");

What about joins I hear you ask? Well thankfully you this pattern takes care of this too. Just have two repositories, pull the full data set for each and do the following.

    var users = _userRepo.GetUser();
    var locations = _locationRepo.GetLocation();
    var result =  from user in users
                  join location in locations on user.locationid equals location.id && location.name = "Parramatta"
                  select user;

The best thing is that its all lazy evaluation so you don’t end up pulling back the full data set into memory. Of course at a large enough scale you will probably hit some sort of leaky abstraction issue and end up rewriting to use pure SQL at some point, but for getting started this method of data access is incredibly powerful with few chances of errors.

Finally you get the advantage that you can provide pure unit tests over your joins. Because you can mock the response from your repository easily you don’t have to create a seed database and provide a connection. This is fantastic for TDD especially when running offline or on your local machine.

Have your own method of writing a clean testable repository layer in C#? If so please comment below as I would love to read about it.

The Unsung Benefits of Software Testing

One benefit that is generally not talked about when discussing testing is the following. The feeling of productivity because you are writing lots of code.

Think about that for a moment. Ask any developer who wants to develop why they became a developer. One of the first things that comes up is “I enjoy writing code”. This is one of the things that I personally enjoy doing. Writing code, any code especially when its solving my current problem makes me feel productive. It makes me feel like I’m getting somewhere. Its empowering.

Now think about how test driven development or indeed any testing methodology fits into this. You write a test, you write a method to pass the test. You write another test, you modify the method. You write another test, you modify the method. Lather rinse repeat. After you have met all your requirements you have quite a bit of code.

Now why is this important? Well firstly you feel like you have done a lot because you have a lot of code. Secondly it plugs straight into the part of the brain that likes rewards. Its the same way that any game becomes addictive. You begin with a specific goal. To reach it you take small steps and each one rewards you as you get there. When you finally make the last hurdle you can look at each of the steps you took, and you feel a great deal of accomplishment.

The above for me is what keeps me turning back to test driven development. I love the feeling of success each time I run the tests and everything comes back fine. I love being able to finish off a method knowing its as good as I can make it. Finally I love having a lot of code which while the user will never see it at least makes me feel like I’m getting somewhere.

What are your feelings about this? I would love to hear your experiences with software testing.

Testing In Software Engineering

Testing is software engineering is a method of providing information about code quality when developing a piece of software. The intent of writing and running tests is the enforce good software design and identify software bugs and defects. These defects can include specification/requirement errors as well as developer mistakes.

The general aim of software testing is the ensure that software meets the following goals,

* Meets its requirements gathered before its design and implementation
* That it responds correctly to the appropriate inputs
* Is usable such that it performs tasks in time and can be run in the appropriate settings
* Achieves the end goals for the stakeholders

There are many different types of software testing including but not limited to, A/B, Acceptance, Alpha, Beta, Compatibility, Development, Functional, Performance, Regression, Security, Smoke and Usability. All of which can be used individually or together to help achieve the above goals.

Software testing can prove a given piece of software to be correct under certain assumptions. However it is worth keeping in mind that the total number of combinations of inputs to any simple product is enormous and as such is is unlikely that testing will be able to find every possible defect. It has been estimated however that the estimated cost of software bugs could be reduced by a third if proper software testing was implemented [NIST The Economic Impacts of Inadequate Infrastructure for Software Testing][1].

[1]: http://www.nist.gov/director/planning/upload/report02-3.pdf

Grouping Tests: Unit/Integration vs Fast/Slow Tests

There is a great deal of argument in the testing community over how to label tests. One camp likes to label tests using levels such that unit tests are in one group, integration in another and so forth. The other likes to label them based on how long they take to run ignoring what level they are in. Fast tests are those that run in milliseconds while slow take longer then this. The reason this is important is that when adopting a testing process slow and flaky tests (those which fail often) are the enemy. Slow tests tend to be run less often in the development process. A delay of a few seconds can seriously interrupt a developer or testers workflow. Also the more often your tests fail randomly the less confidence you are likely to have in them, ignoring genuine errors until they fail multiple times.

Its worth keeping in mind that unit tests are less likely to become flaky or fail but if they take a long time to run its unlikely developers will run them before every check in.

Personally I prefer the latter approach.

Getting hung up on the pureness of your unit tests is usually an impediment to progress. Not all unit tests run quickly (although they should run consistently!) just as not all integration tests are slow. By dividing tests into those expected to run quickly and those slowly you can ensure that you run the majority of your tests more often.

An example of a slow running unit test I encountered was for some image recognition software I was working on. There was some fairly involved math in a few functions that slowed down the test to hundreds of milliseconds. Not a huge issue by itself, but it was compounded by having to test this function thousands of times for various edge cases. The result was a test that took almost a minute to run. Technically it was correct (the best kind of correct) to call it a unit test but it was not something you wanted to run for every change. Labelling it as a slow test sped up the development process considerably.

An example of a fast running integration test was a function responsible for writing the current date time to a file on disk as a heart beat check. Doing a pure unit test of this function would have required either mocking away the file system or building wrappers over low level disk access methods. However the test only needed to be run once and ran in less than 50 milliseconds on every piece of hardware tested. It was categorised as being a fast test and as such was run very frequently. Incidentally this turned out to be a boon later on as it was discovered after several months that there were a few exception cases not handled correctly that caused the test to fail. It turned out that this simple error could have caused some cascading failures of the production stack and some serious down time! It was unlikely that this would have been picked up had the test been less often.

Sanity Testing

Software sanity tests are closely associated with smoke tests. They attempt to determine if is reasonable to continue with testing a given piece of software. The objective is not to test all functionality, but to determine if there is value in doing so. You can consider it a “Should I continue to test this?” check. Sanity tests differ from smoke tests as they exist to check if new functionality has been met and existing bugs have been resolved.

* Sanity tests are a way to avoid wasting time testing obviously flawed software IE is this software sane?
* They are almost not automated at first but later can be made into regression tests
* Sanity tests generally follow smoke tests in the build pipeline
* Differ from smoke tests
* Check if planned functionality works or that a bug has been resolved
* Sanity tests usually have a narrow focus on a few pieces of functionality or bugs

The following examples are considered sanity tests.

* Compiling and running a “Hello world!” program for a new developer environment
* Checking that a calculator when given 2 + 2 produces 4 as the result
* Confirming that a dialog box closes when the close button is clicked

Many consider sanity testing to be a subset of acceptance testing and one of the first layers in ensuring software quality.