Types of Testing in Software Engineering

There are many different types of testing which exist in software engineering. They should not be confused with the test levels, unit testing, integration testing, component interface testing, and system testing. However the different test levels may be used by each type as a way of checking for software quality.

The following are all different types of tests in software engineering.

A/B
: A/B testing is testing the comparison of two outputs where a single unit has changed. It is commonly used when trying to increase conversion rates for online websites. A real genius in this space is Patrick McKenzie and a few very worthwhile articles to read about it are How Stripe and AB Made me A Small Fortune and AB Testing

Acceptance
: Acceptance tests usually refer to tests performed by the customer. Also known as user acceptance testing or UAT. Smoke tests are considered an acceptance test.

Accessibility
: Accessibility tests are concerned with checking that the software is able to be used by those with vision, hearing or other impediments.

Alpha
: Alpha testing consists of operational testing by potential users or an independent test team before the software is feature complete. It usually consists of an internal acceptance test before the software is released into beta testing.

Beta
: Beta testing follows alpha testing and is form of external user acceptance testing. Beta software is usually feature complete but with unknown bugs.

Concurrent
: Concurrent tests attempt to simulate the software in use under normal activity. The idea is to discover defects that occur in this situation that are unlikely to occur in other more granular tests.

Conformance
: Conformance testing verifies that software conforms to specified standards. An example would checking a compiler or interpreter to see if it will work as expect against the language standards.

Compatibility
: Checks that software is compatible with other software on a system. Examples would be checking the Windows version, Java runtime version or that other software to be interfaced with have the appropriate API hooks.

Destructive
: Destructive tests attempt to cause the software to fail. The idea being to check that software continues to work even with given unexpected conditions. Usually done through fuzzy testing and deliberately breaking subsystems such as the disk while the software is under test.

Development
: Development testing is testing done by both the developer and tests during the development of the software. The idea is to prevent bugs during the development process and increase the quality of the software. Methodologies to do so include peer reviews, unit tests, code coverage and others.

Functional
: Functional tests generally consist of stories focussed around the users ability to perform actions or use cases checking if functionality works. An example would be “can the user save the document with changes”.

Installation
: Ensures that software is installed correctly and works as expected on a new piece of hardware or system. Commonly seen after software has been installed as a post check.

Internationalisation
: Internationalisation tests check that localisation for other countries and cultures in the software is correct and inoffensive. Checks can include checking currency conversions, word range checks, font checks, timezone checks and the like.

Non functional
: Non functional tests test the parts of the software that are not covered by functional tests. These include things such as security or scalability which generally determine the quality of the product.

Performance / Load / Stress
: Performance load or stress testing is used to see how a system performance under certain high or low workload conditions. The idea is to see how the system performs under these conditions and can be used to measure scalability and resource usage.

Regression
: Regression tests are an extension of sanity checks which aim to ensure that previous defects which had a test written do not re-occur in a given software product.

Realtime
: Realtime tests are to check systems which have specific timing constraints. For example trading systems or heart monitors. In these case real time tests are used.

Smoke / Sanity
: Smoke testing ensures that the software works for most of the functionality and can be considered a verification or acceptance test. Sanity testing determines if further testing is reasonable having checked a small set of functionality for flaws.

Security
: Security testing concerned with testing that software protects against unauthorised access to confidential data.

Usability
: Usability tests are manual tests used to check that the user interface if any is understandable.

What is Chaos Testing / Engineering

A blog post by the excellent technical people at Netflix about Chaos Engineering and further posts about the subject by Microsoft in Azure Search prompted me to ask the question, What is chaos engineering and how can chaos testing be applied to help me?

What is Chaos Testing?

First coined by the afore mentioned Netflix blog post, chaos engineering takes the approach that regardless how encompassing your test suite is, once your code is running on enough machines and reaches enough complexity errors are going to happen. Since failure is unavoidable, why not deliberately introduce it to ensure your systems and processes can deal with the failure?

To accomplish this, Netflix created the Netflix Simian Army, which consists of a series of tools known as “monkeys” (AKA Chaos Monkey’s) that deliberately inject failure into their services and systems. Microsoft adopted a similar approach by creating their own monkey’s which were able to inject faults into their test environments.

What are the advantages of Chaos Testing?

The advantage of chaos engineering is that you can quickly smoke out issues that other testing layers cannot easily capture. This can save you a lot of downtime in the future and help design and build fault tolerant systems. For example, Netflix runs in AWS and as a response to a regional failure changed their systems to become region agnostic. The easiest way to confirm this works is to regularly take down important services in separate regions, which is all done through a chaos monkey designed to replicate this failure.

While it is possible to sit down and anticipate some of the issues you can expect when a system fails it knowing what actually happens is another thing.

The result of this is you are forced to design and build highly fault tolerant systems and to withstand massive outages with minimal downtime. Expecting your systems to not have 100% uptime and planning accordingly to avoid this can be a tremendous competitive advantage.

One thing commonly overlooked with chaos engineering is its ability to find issues caused by cascading failure. You may be confident that your application still works when the database goes down, but would you be so sure if it when down along with your caching layer?

Should I be Chaos Testing?

This really depends on what your tolerances for failure are and based on the likely hood of them happening. If you are writing desktop software chaos testing is unlikely to yield any value. Much the same applies if you are running a financial system where failures are acceptable so long as everything reconciles at the end of the day.

If however you are running large distributed systems using cloud computing (think 50 or more instances) with a variety of services and process’s designed to scale up and out injecting some chaos will potentially be very valuable.

How to start Chaos Testing?

Thankfully with cloud computing and the API’s provided it can be relatively easy to begin chaos testing. These tools by allowing you to control the infrastructure through code allow the replication of a host of errors not easily reproducible when running bare hardware. This does not mean that bare hardware systems cannot perform chaos testing, just that some classes of errors will be harder to reproduce.

Lets start by looking at the way Microsoft and Netflix classify their “monkey’s”.

Low chaos
: This refers to failures that our system can recover from gracefully with minimal or no interruption to service availability.

Medium chaos
: Are failures that can also be recovered from gracefully, but may result in degraded service performance or availability.

High chaos
: Are failures that are more catastrophic and will interrupt service availability.

Extreme chaos
: Are operations are failures that cause ungraceful degradation of the service, result in data loss, or that simply fail silently without raising alerts.

Microsoft found that by setting up a testing environment and letting the monkey’s loose that they were able to identify a variety of issues with provisioning instances and services as well as scaling them to suit. They also split the environments into periods of chaos where the monkey’s ran and dormant periods where they did not. Errors found in dormant periods were considered bugs, and flagged to be investigated and fixed. During chaos periods any low issues were also considered bugs and scheduled to be investigated and fixed. Medium issues raised low priority issues to on call staff to investigate along with high level issues. Extreme operations once identified were not run again until a fix had been introduced.

The process for fixing issues identified through this process was the following,

* Discover the issue, identify the impacts if any and determine the root cause.
* Mitigate the issue to prevent data loss or service impact in any customer facing environments
* Reproduce the error through automation
* Fix the error and verify through the previous step it will not reoccur

Once done the monkey created through the automation step could be added the the regular suite of tests ensuring that whatever issue was identified would not occur again.

Netflix uses a similar method for fixing issue, but by contrast run’s their monkey’s in their live environments rather then in a pure testing environment. They also released some information on some of the monkey’s they used to introduce failures.

Latency Monkey
: Induces artificial delays into the client-server communication layer to simulate service degradation and determine how consumers respond in this situation. By making very large delays they are able to simulate a node or even an entire service downtime. This can be useful as bringing an entire instance down can be problematic when an instance hosts multiple services and when it is not possible to do so through API’s.

Conformity Monkey / Security Monkey
: Finds instances that don’t adhere to best-practices and shuts them down. Examples for this would be checking that instances in AWS are launched into permission limited roles and if they are not shutting them down. This forces the owner of the instance to investigate and fix issues. Security monkey as an extension that performs SSL certificate validation / expiry and other security best practice checks.

Doctor Monkey
: Checks existing health checks that run on each instances to detect unhealthy instances. Unhealthy instances are removed from service.

Janitor Monkey
: Checks for unused resources and deletes or removes them.

10-18 Monkey (Localisation monkey)
: Ensures that services continue to work in different international environments by checking that languages other then the base system consisting to work

Chaos Gorilla
: Similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.

Well hopefully that explains what Chaos Testing / Engineering is for those who were previously unsure. Feel free to contact me over twitter or via the comments for further queries or information!

Running three hours of Ruby tests in under three minutes

Recently the very cool hard working developers working on Stripe released a post about how they modified their build/test pipeline to reduce their test suite runtime from 3 hours to about 3 minutes.

The article is very much worth reading, as is the discussions that have come around it including those on Hacker News.

A few key takeaways,

* For dynamic languages such as Ruby or Python consider forking to run tests in parallel
* Forks are usually faster then threads in these cases and provide good test isolation
* For integration tests use docker which allows you to revert the file system easily

The above ensures that the tests are generally more reliable and you avoid having to write your own teardown code which restores state, both in memory for the forks and on disk using Docker.

A Culture of Quality

The best working environment I had the pleasure to work in had a strong emphasis on testing and software quality in general. Product teams were encouraged to spend extra time ensuring that everything worked over shipping before it was ready. The transformation it went through was incredible. Having come from a culture very much wild west through to where it was. An example of the advantages this brought was that before adopting this culture website launches were a traumatic event. Teams would block out 48 hours stretches and work solid fixing bugs at go live. Very stressful for all involved and not a healthy working environment. Contrast to where things ended up where several websites were launched on the same afternoon without a hitch. Very impressive considering the scale of the websites being dealt with (several million uniques a day).

I attribute most of these improvements being due to cultural changes that started in upper management and filtered down. In short the organisation adopted a culture of writing quality software and implemented in part by insisting on solid development process backed by a testing. This culture was so successful I remember the following happening one afternoon. Two individuals were discussing a new piece of functionality to be worked on. After several minutes discussing they convinced themselves that for a few situations that perhaps they didn’t need tests. This conversation was overheard by a less senior developer in another team who happened to sit behind them. Without pause he turned around and calmly insisted that not only would they be writing the tests they had tried to convince themselves were not required and that he would show them how to implement it.

This sort of culture where taking ownership of quality not just for your own work but for others can produce incredible results. As far as I am aware it is still being practiced as I experienced it there and I can attest to how effective it has been.

A/B Testing

A/B testing is testing the comparison of two outputs where a single unit has changed. It is commonly used when when trying to increase conversion rates for online websites. Also known as split testing an example would be trying to increase user clicks on a specific button on a website. You may have a theory that red buttons work better then green. You would try out both against real users and see which one performs better.

You should use A/B testing when you have some existing traffic to a website (or mailing list), know what outcomes you are trying to achieve and most importantly are able to track what actions resulted in the outcome. Without these three things split testing will not achieve the outcomes you wish.

Numerous things are able to be A/B split tested and are limited to what you have to pivot on and your imagination. A few examples include,

* Pricing tiers and structures
* Free trial time lengths
* Titles E.G. length and word placement
* Headlines and Sub-Headlines E.G. size, length, word placement
* Forms E.G. adding and removing requested fields
* Paragraph Text E.G. adding more content, changing spacing
* Testimonials E.G. adding more or less, increasing the length
* Call to Action (text, links, buttons or images) E.G. “add to cart” vs “buy now”
* Movement of content around the page
* Press mentions and awards

Things to keep in mind when A/B testing is that running your tests for a long time can result in SEO penalties from Google and other search engines. Quoted from the Google Webmaster Central Blog on Website Testing,

“If we discover a site running an experiment for an unnecessarily long time, we may interpret this as an attempt to deceive search engines and take action accordingly. This is especially true if you’re serving one content variant to a large percentage of your users.”

It is highly recommended to read and understand the post mentioned in order to ensure you are following best practices. The consequences can be dire indeed including being black listed by Google and other search engines as the worst possible result.

A/B testing can be implemented in a variety of ways. Perhaps the best known is using using Google Analytics. However there are other free and paid for solutions. Visual Website Optimizer is one example of a paid for service and if you are using Ruby on Rails there are many libraries to help you out.

A few things to keep in mind when doing A/B testing.

* Test all the variations of a single entity at the same time. If you perform a test on one variant for a week and another the following week your data is likely to be skewed. Its possible that you had some excellent high value links added the second week but it had a lower conversion rate.
* Keep the test running long enough to have confidence in your results but not so long as to be penalised by Google. You need to have statistical significance. Any A/B tool worth money will be able to report on this metric for you and let you know when to finish the test. It is very useful to know how long you are going to run the test before starting.
* Trust the data over feeling. If the data is telling you that an ugly button works better then your beautiful one either trust the data or run the test again at a later date to confirm. It can be hard to do what feels counter intuitive but you need to remember that humans generally are not rational and will not behave how you expect.
* If a user complains about seeing a different price offer them the better deal. Always respect your users and customers. It builds good will. Another thing to do is avoid split testing paying customers. Adobe Omniture runs a lot of A/B tests in their online product and it drives some customer’s crazy as everything they need moves around on a regular basis. Just don’t do it.
* Don’t A/B test multiple things at the same time. If you are doing to A/B test a better design then test the better design against the other one. Don’t chop and change various parts of the website. It will be confusing.
* Keep trying. Its possible a single test will produce no meaningful results. If so try again. Not only will you get better with experience you are more likely to find the correct things to optimise.

A real genius in this space is Patrick McKenzie and a few very worthwhile articles to read about it are A-B Testing Made Me a Small Fortune and A-B Testing. Other articles worth reading include, Practical Guide to Controlled Experiments on the Web by Microsoft Research (PDF), Writing Decisions: Headline Tests on the Highrise Sign-Up Page], “You Should Follow Me on Twitter Here”, How We Increased our Conversion Rate by 72%, Human Photos Double your Conversion Rate

Five ways to avoid and control flaky tests

Having a reliable test suite should always be the goal in software development. After all if you can’t trust the tests then why bother running them at all? This is especially important in a shared coding environment and when running through Continuous Integration (CI).

keep-calm-and-don-t-break-the-build-e1437343295860

1. Test in Isolation

It may seem obvious but writing focused tests which do a single thing is one of the most effective ways to avoid them being flaky. Tests which do multiple things increases the chance for failure and can make the tests non deterministic. Always remember to test features and issues in isolation.

2. Write Helpful Error’s

When a test does fail having an error such as “Error 23: Something went wrong ¯\_(ツ)_/¯” is incredibly frustrating. Firstly you need to run the test again with either a debugger or some code modifications to spot the bug which slows down development its also unprofessional. Write meaningful error messages. For example “Error: The value “a” was unexpected for this input” is a far better error. Another thing to remember is avoid swallowing the exception in languages which support exception handling.

3. Control the Environment

Regularly run tests should run in a controlled environment which will be the same for the current test and future tests. This usually means a clean deploy, restoring databases and generally ensuring that however the application was setup originally is done again. This ensures the tests always start with the same conditions. This also ensures you have a good CI process and are able to recreate environments from scratch when required which is good development process.

4. Fix it, delete it or mark it

A test that fails is proving its value, unless its flaky. Tests that fail randomly slow down your development process. In time they will be ignored and neglected. The moment a test is identified as failing it should be fixed. If it will take time then mark it as being flaky, remove it from the CI pipeline and investigate as part of paying down technical debt. If after time it still isn’t resolved it should be investigated to see if it is providing any actual value. Odds are if it hasn’t been fixed for a month it may be a test you can live without.

5. Be forgiving but verify

For any integration test you need to have your tests be forgiving in how the application responds. After all submitting an image into a text field may result in an error which is probably acceptable. Other things to keep in mind are that there will be timeouts you will need to deal with. Be sure to have a reasonable length of time to wait for a response and only once this has expired to fail. Be wary of any test that waits forever for something to happen.

Why Does Software Contain Bugs?

“Why does all software contain bugs?” this was a question recently asked of me. My response at the time was because all software is not perfect, but is this true?

Lets take a very simple example.


    public class Hello {
        public static void main(String[] args) {
            System.out.println("Hello World!");
        }
    }

The above example is arguably the simplest program that can be written using Java. It also happens to be the first program usually written by any Java programmer. It simply print outs the text “Hello World!” when it is run. Surely this program is so simple that it is perfect and therefore bug free?

Ignoring the obvious that this program does nothing useful, lets assume for the moment that we have been tasked to write a “Hello World!” program in Java. Surely the above is 100% bug free.

Yes. The application is 100% bug free. But thats not the whole story. What happens when this application run?

The first thing to happen is it needs to be compiled. This takes the application from its text form converting it into something that the computer can understand. In the case of Java it turns it into something the Java Virtual Machine can understand. This allows you to take the same compiled program and in theory run it on your computer, phone, playstation, blu ray, ipad or any other device that runs Java.

The Java Virtual Machine or JVM is itself a compiled program running on a device. The catch is that it is compiled using a different compiler for every platform (computer, phone etc…). When it runs it takes your compiled application and converts the instructions into something that the computer understands.

However underneath the JVM is usually the Operating System. This hands out resources to the programs that are running. So the JVM runs inside the operating system and the operating system talks to the physical hardware.

Actually while the operating system does talk to the hardware directly there is usually software inside the hardware itself which controls the actual hardware these days. Not only does the hardware contain software the hardware itself such as a CPU is literally software turned into hardware. This means CPU’s and the like can also contain bugs.

This means in order for your perfect application to run perfectly the following pieces of software also need to run perfectly,


    Your Program -> Java Compiler -> JVM -> JVM Compiler -> Operating System -> Operating System Compiler -> Hardware Software -> Hardware Software Compiler -> Hardware Itself

As you can see it really is an iceberg with your perfect program at the top and lot of things going on underneath. Any bug in any level can result in “perfect” software not working as expected making it flawed.

This is why perfect software does not currently exist. The only way to do so would to be by writing perfect software at every level which is a monumental undertaking. There are estimates around that suggest that the cost to rewrite the Linux kernel as being around 500 billion dollars, and thats not really accounting for making it “perfect”, and as shown is literally one small piece of the puzzle.

So should we just give in? Well no. At every level there are thousands of testers and processes designed to make the software as bug free as possible. Just because we cannot reach perfection does not mean it is not at least worth trying.

The benefit of testing for Developers, Managers and the Business

“Fixing regression bugs is analogous to digging a hole only to find the next day it has been filled in and having to dig it out again”

Ask any manager, developer or tester working on software without tests what the main pain points are. Nearly all the time the main one mentioned is dealing with regressions. This is bugs that were fixed a year ago which returned. Regression bugs cost the software industry billions of dollars a year. Worse still they are demoralising to everyone involved. Finding or fixing the same bug over and over causes you to start looking for new projects or new jobs.

A good regression test suite generally solves these problems. It may not prevent every bug from reoccurring but it will catch most of them. It also gives peace of mind that you have not reintroduced these bugs again once fixed. Lastly it saves time by checking if the bug is fixed for you.

“Software is never finished, only abandoned”

Another advantage to come out of having a good test suite are the improvements to the software itself. Not only is testable software generally better written then non-testable software, a collection of tests provides a nice safety next when you start to improve your product. Software as a general rule is constantly in flux with changes and improvements or left as is. If your software is never improved or built upon you may want to consider if you really need tests. That said there are other reasons to test software beyond what is mentioned above.

If none of the above points are selling testing to you consider this. As mentioned testable software is usually well designed modular software. This allows greater code reuse saving time and money. It also allows new developers to quickly understand what they are dealing with allowing them to become productive faster. If nothing else writing testable software will save you time and money in the long run by making things easier to understand.

AWS EC2 Instance Types to Use as Test Agents

When you are running test agents on AWS knowing what instance type to run as test agents (for TeamCity or otherwise) can involve a lot of trial and error. Not only can there be great savings to be made by picking the correct instance type you can speed up your builds and get test feedback back faster which can be far more valuable the cost of a few additional cents an hour.

The following are some results that I have found when playing with different instance types. Before Amazon released the burstable t2 instance types one of the most common instances I had seen used was general purpose instances such as the m3.medium. This always seemed like a good choice as most tests tend to use a mixture of CPU/Disk/Network and thats what the agents are supposed to be good at.

The moment that the bustable instances were released several agents were relaunched as t2.mediums and left in the cluster for a week.

The outcome was that no only were they saving money since they cost lest per month, they were able to run tests and build faster then the previous agents. This was a surprise at first until we observed that with very few exceptions every test was CPU bound. This included browser tests which we had expected to be more network bound. The performance increase as such was mostly down to them accumulating credits over time faster then they could be spent. See the below image which was taken from a live instance where you can clearly see how this works.

t2_credit_usage

For the record this agent runs 24/7 running builds and tests over dozens of different projects including a lot of selenium tests for multiple browsers.

There were however a few tests which consumed considerably more CPU then expected. These tests comprised of a collection of very heavy math operations and integrations all running on the same machine. A single agent was boosted to a c4.medium to take care of these tests and everything has been working fine since. Build times were down and the developers had feedback sooner.

We also tried relaunching the instances with a higher number, such as a m3.large into a m4.large and the result was far faster builds. This is probably due to the underlying hardware AWS is using being faster. It was however still worth using t2 agents due to the cost saving and roughly equivalent performance.

Conclusions

It really depends on your environment and how much you are using the agents. I think the following guidelines apply fairly well though.

* For any test agents running Windows you want the minimum of a t2.medium on AWS or the equivalent with 2 CPU’s and 4 gig of RAM.
* Test agents running Linux want to be the minimum of a t2.small on AWS or the equivalent with a single CPU and 2 gig of RAM.
* For agents that run tests infrequently as in less then 6 times an hour stick with the lower end t2 instances.
* For agents that run heavy loads consider using a c4.large as the increased CPU will really cut down on the test time.
* Always go for the latest type in AWS, IE use a c4.large over a c3.large for increased performance

The main takeaway however is ensure you can relaunch your instances as different types easily. Try out different types and see what happens. The winning strategy I found was to launch as a t2.medium at first and then dial it down to a t2.small if it was overpowered (which was never the case for Windows) and relaunch as a c4.medium if it was underpowered.

The result was much faster builds saving developer time and frustration.

Issues with Google’s Bug Prediction Algorithm

December 2011 the Google Engineering team published a blog post about bug prediction at Google. The topic caused quite a lot of discussion at the time over the internet on forums such as Hacker News and the Reddit programming sub-reddit.

How bug prediction works

In a nutshell the prediction works by ranking files against checking the file commit history and seeing how many changes have been flagged as bug fixes. Of course this means that code which was previously buggy will still appear in the list. This issue was also addressed in the post and the results were weighted over time to deal with this issue.

Issues with bug prediction

Since that time the topic has been reposted a few times and we have since discovered that the system has been discontinued at Google. Thankfully the author of the original post was able to respond and has given one of the main reasons why it was discontinued.

TL;DR is that developers just didn’t find it useful. Sometimes they knew the code was a hot spot, sometimes they didn’t. But knowing that the code was a hot spot didn’t provide them with any means of effecting change for the better. Imagine a compiler that just said “Hey, I think this code you just wrote is probably buggy” but then didn’t tell you where, and even if you knew and fixed it, would still say it due to the fact it was maybe buggy recently. That’s what TWR essentially does. That became understandably frustrating, and we have many other signals that developers can act on (e.g. FindBugs), and we risked drowning out those useful signals with this one.

Some teams did find it useful for getting individual team reports so they could focus on places for refactoring efforts, but from a global perspective, it just seemed to frustrate, so it was turned down.

From an academic perspective, I consider the paper one of my most impactful contributions, because it highlights to the bug prediction community some harsh realities that need to be overcome for bug prediction to be useful to humans. So I think the whole project was quite successful… Note that the Rahman algorithm that TWR was based on did pretty well in developer reviews at finding bad code, so it’s possible it could be used for automated tools effectively, e.g. test case prioritization so you can find failures earlier in the test suite. I think automated uses are probably the most fruitful area for bug prediction efforts to focus on in the near-to-mid future.

Another paper released which analyses the results of the Google bug predictor can be found Does Bug Prediction Support Human Developers? Findings From a Google Case Study (PDF).

Another interesting thought is that the system requires a reasonably amount of code comment discipline. Some people and teams use bug fix commit’s to resolve feature requests. The result is that this system would mark actively developed code as being a bug hot spot. In this case being a poorly organised team or individual would not see their code appear. This is especially problematic where performance is being measured against these sort of metrics.

If you are interested in trying this yourself there are some open source implementations of the algorithms presented in the Google Paper. See bugspots in github as an example of one which will work against any git repository.