Types of Testing in Software Engineering

There are many different types of testing which exist in software engineering. They should not be confused with the test levels, unit testing, integration testing, component interface testing, and system testing. However the different test levels may be used by each type as a way of checking for software quality.

The following are all different types of tests in software engineering.

A/B
: A/B testing is testing the comparison of two outputs where a single unit has changed. It is commonly used when trying to increase conversion rates for online websites. A real genius in this space is Patrick McKenzie and a few very worthwhile articles to read about it are How Stripe and AB Made me A Small Fortune and AB Testing

Acceptance
: Acceptance tests usually refer to tests performed by the customer. Also known as user acceptance testing or UAT. Smoke tests are considered an acceptance test.

Accessibility
: Accessibility tests are concerned with checking that the software is able to be used by those with vision, hearing or other impediments.

Alpha
: Alpha testing consists of operational testing by potential users or an independent test team before the software is feature complete. It usually consists of an internal acceptance test before the software is released into beta testing.

Beta
: Beta testing follows alpha testing and is form of external user acceptance testing. Beta software is usually feature complete but with unknown bugs.

Concurrent
: Concurrent tests attempt to simulate the software in use under normal activity. The idea is to discover defects that occur in this situation that are unlikely to occur in other more granular tests.

Conformance
: Conformance testing verifies that software conforms to specified standards. An example would checking a compiler or interpreter to see if it will work as expect against the language standards.

Compatibility
: Checks that software is compatible with other software on a system. Examples would be checking the Windows version, Java runtime version or that other software to be interfaced with have the appropriate API hooks.

Destructive
: Destructive tests attempt to cause the software to fail. The idea being to check that software continues to work even with given unexpected conditions. Usually done through fuzzy testing and deliberately breaking subsystems such as the disk while the software is under test.

Development
: Development testing is testing done by both the developer and tests during the development of the software. The idea is to prevent bugs during the development process and increase the quality of the software. Methodologies to do so include peer reviews, unit tests, code coverage and others.

Functional
: Functional tests generally consist of stories focussed around the users ability to perform actions or use cases checking if functionality works. An example would be “can the user save the document with changes”.

Installation
: Ensures that software is installed correctly and works as expected on a new piece of hardware or system. Commonly seen after software has been installed as a post check.

Internationalisation
: Internationalisation tests check that localisation for other countries and cultures in the software is correct and inoffensive. Checks can include checking currency conversions, word range checks, font checks, timezone checks and the like.

Non functional
: Non functional tests test the parts of the software that are not covered by functional tests. These include things such as security or scalability which generally determine the quality of the product.

Performance / Load / Stress
: Performance load or stress testing is used to see how a system performance under certain high or low workload conditions. The idea is to see how the system performs under these conditions and can be used to measure scalability and resource usage.

Regression
: Regression tests are an extension of sanity checks which aim to ensure that previous defects which had a test written do not re-occur in a given software product.

Realtime
: Realtime tests are to check systems which have specific timing constraints. For example trading systems or heart monitors. In these case real time tests are used.

Smoke / Sanity
: Smoke testing ensures that the software works for most of the functionality and can be considered a verification or acceptance test. Sanity testing determines if further testing is reasonable having checked a small set of functionality for flaws.

Security
: Security testing concerned with testing that software protects against unauthorised access to confidential data.

Usability
: Usability tests are manual tests used to check that the user interface if any is understandable.

What is Chaos Testing / Engineering

A blog post by the excellent technical people at Netflix about Chaos Engineering and further posts about the subject by Microsoft in Azure Search prompted me to ask the question, What is chaos engineering and how can chaos testing be applied to help me?

What is Chaos Testing?

First coined by the afore mentioned Netflix blog post, chaos engineering takes the approach that regardless how encompassing your test suite is, once your code is running on enough machines and reaches enough complexity errors are going to happen. Since failure is unavoidable, why not deliberately introduce it to ensure your systems and processes can deal with the failure?

To accomplish this, Netflix created the Netflix Simian Army, which consists of a series of tools known as “monkeys” (AKA Chaos Monkey’s) that deliberately inject failure into their services and systems. Microsoft adopted a similar approach by creating their own monkey’s which were able to inject faults into their test environments.

What are the advantages of Chaos Testing?

The advantage of chaos engineering is that you can quickly smoke out issues that other testing layers cannot easily capture. This can save you a lot of downtime in the future and help design and build fault tolerant systems. For example, Netflix runs in AWS and as a response to a regional failure changed their systems to become region agnostic. The easiest way to confirm this works is to regularly take down important services in separate regions, which is all done through a chaos monkey designed to replicate this failure.

While it is possible to sit down and anticipate some of the issues you can expect when a system fails it knowing what actually happens is another thing.

The result of this is you are forced to design and build highly fault tolerant systems and to withstand massive outages with minimal downtime. Expecting your systems to not have 100% uptime and planning accordingly to avoid this can be a tremendous competitive advantage.

One thing commonly overlooked with chaos engineering is its ability to find issues caused by cascading failure. You may be confident that your application still works when the database goes down, but would you be so sure if it when down along with your caching layer?

Should I be Chaos Testing?

This really depends on what your tolerances for failure are and based on the likely hood of them happening. If you are writing desktop software chaos testing is unlikely to yield any value. Much the same applies if you are running a financial system where failures are acceptable so long as everything reconciles at the end of the day.

If however you are running large distributed systems using cloud computing (think 50 or more instances) with a variety of services and process’s designed to scale up and out injecting some chaos will potentially be very valuable.

How to start Chaos Testing?

Thankfully with cloud computing and the API’s provided it can be relatively easy to begin chaos testing. These tools by allowing you to control the infrastructure through code allow the replication of a host of errors not easily reproducible when running bare hardware. This does not mean that bare hardware systems cannot perform chaos testing, just that some classes of errors will be harder to reproduce.

Lets start by looking at the way Microsoft and Netflix classify their “monkey’s”.

Low chaos
: This refers to failures that our system can recover from gracefully with minimal or no interruption to service availability.

Medium chaos
: Are failures that can also be recovered from gracefully, but may result in degraded service performance or availability.

High chaos
: Are failures that are more catastrophic and will interrupt service availability.

Extreme chaos
: Are operations are failures that cause ungraceful degradation of the service, result in data loss, or that simply fail silently without raising alerts.

Microsoft found that by setting up a testing environment and letting the monkey’s loose that they were able to identify a variety of issues with provisioning instances and services as well as scaling them to suit. They also split the environments into periods of chaos where the monkey’s ran and dormant periods where they did not. Errors found in dormant periods were considered bugs, and flagged to be investigated and fixed. During chaos periods any low issues were also considered bugs and scheduled to be investigated and fixed. Medium issues raised low priority issues to on call staff to investigate along with high level issues. Extreme operations once identified were not run again until a fix had been introduced.

The process for fixing issues identified through this process was the following,

* Discover the issue, identify the impacts if any and determine the root cause.
* Mitigate the issue to prevent data loss or service impact in any customer facing environments
* Reproduce the error through automation
* Fix the error and verify through the previous step it will not reoccur

Once done the monkey created through the automation step could be added the the regular suite of tests ensuring that whatever issue was identified would not occur again.

Netflix uses a similar method for fixing issue, but by contrast run’s their monkey’s in their live environments rather then in a pure testing environment. They also released some information on some of the monkey’s they used to introduce failures.

Latency Monkey
: Induces artificial delays into the client-server communication layer to simulate service degradation and determine how consumers respond in this situation. By making very large delays they are able to simulate a node or even an entire service downtime. This can be useful as bringing an entire instance down can be problematic when an instance hosts multiple services and when it is not possible to do so through API’s.

Conformity Monkey / Security Monkey
: Finds instances that don’t adhere to best-practices and shuts them down. Examples for this would be checking that instances in AWS are launched into permission limited roles and if they are not shutting them down. This forces the owner of the instance to investigate and fix issues. Security monkey as an extension that performs SSL certificate validation / expiry and other security best practice checks.

Doctor Monkey
: Checks existing health checks that run on each instances to detect unhealthy instances. Unhealthy instances are removed from service.

Janitor Monkey
: Checks for unused resources and deletes or removes them.

10-18 Monkey (Localisation monkey)
: Ensures that services continue to work in different international environments by checking that languages other then the base system consisting to work

Chaos Gorilla
: Similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.

Well hopefully that explains what Chaos Testing / Engineering is for those who were previously unsure. Feel free to contact me over twitter or via the comments for further queries or information!

Running three hours of Ruby tests in under three minutes

Recently the very cool hard working developers working on Stripe released a post about how they modified their build/test pipeline to reduce their test suite runtime from 3 hours to about 3 minutes.

The article is very much worth reading, as is the discussions that have come around it including those on Hacker News.

A few key takeaways,

* For dynamic languages such as Ruby or Python consider forking to run tests in parallel
* Forks are usually faster then threads in these cases and provide good test isolation
* For integration tests use docker which allows you to revert the file system easily

The above ensures that the tests are generally more reliable and you avoid having to write your own teardown code which restores state, both in memory for the forks and on disk using Docker.

A Culture of Quality

The best working environment I had the pleasure to work in had a strong emphasis on testing and software quality in general. Product teams were encouraged to spend extra time ensuring that everything worked over shipping before it was ready. The transformation it went through was incredible. Having come from a culture very much wild west through to where it was. An example of the advantages this brought was that before adopting this culture website launches were a traumatic event. Teams would block out 48 hours stretches and work solid fixing bugs at go live. Very stressful for all involved and not a healthy working environment. Contrast to where things ended up where several websites were launched on the same afternoon without a hitch. Very impressive considering the scale of the websites being dealt with (several million uniques a day).

I attribute most of these improvements being due to cultural changes that started in upper management and filtered down. In short the organisation adopted a culture of writing quality software and implemented in part by insisting on solid development process backed by a testing. This culture was so successful I remember the following happening one afternoon. Two individuals were discussing a new piece of functionality to be worked on. After several minutes discussing they convinced themselves that for a few situations that perhaps they didn’t need tests. This conversation was overheard by a less senior developer in another team who happened to sit behind them. Without pause he turned around and calmly insisted that not only would they be writing the tests they had tried to convince themselves were not required and that he would show them how to implement it.

This sort of culture where taking ownership of quality not just for your own work but for others can produce incredible results. As far as I am aware it is still being practiced as I experienced it there and I can attest to how effective it has been.

Five ways to avoid and control flaky tests

Having a reliable test suite should always be the goal in software development. After all if you can’t trust the tests then why bother running them at all? This is especially important in a shared coding environment and when running through Continuous Integration (CI).

keep-calm-and-don-t-break-the-build-e1437343295860

1. Test in Isolation

It may seem obvious but writing focused tests which do a single thing is one of the most effective ways to avoid them being flaky. Tests which do multiple things increases the chance for failure and can make the tests non deterministic. Always remember to test features and issues in isolation.

2. Write Helpful Error’s

When a test does fail having an error such as “Error 23: Something went wrong ¯\_(ツ)_/¯” is incredibly frustrating. Firstly you need to run the test again with either a debugger or some code modifications to spot the bug which slows down development its also unprofessional. Write meaningful error messages. For example “Error: The value “a” was unexpected for this input” is a far better error. Another thing to remember is avoid swallowing the exception in languages which support exception handling.

3. Control the Environment

Regularly run tests should run in a controlled environment which will be the same for the current test and future tests. This usually means a clean deploy, restoring databases and generally ensuring that however the application was setup originally is done again. This ensures the tests always start with the same conditions. This also ensures you have a good CI process and are able to recreate environments from scratch when required which is good development process.

4. Fix it, delete it or mark it

A test that fails is proving its value, unless its flaky. Tests that fail randomly slow down your development process. In time they will be ignored and neglected. The moment a test is identified as failing it should be fixed. If it will take time then mark it as being flaky, remove it from the CI pipeline and investigate as part of paying down technical debt. If after time it still isn’t resolved it should be investigated to see if it is providing any actual value. Odds are if it hasn’t been fixed for a month it may be a test you can live without.

5. Be forgiving but verify

For any integration test you need to have your tests be forgiving in how the application responds. After all submitting an image into a text field may result in an error which is probably acceptable. Other things to keep in mind are that there will be timeouts you will need to deal with. Be sure to have a reasonable length of time to wait for a response and only once this has expired to fail. Be wary of any test that waits forever for something to happen.

Why Does Software Contain Bugs?

“Why does all software contain bugs?” this was a question recently asked of me. My response at the time was because all software is not perfect, but is this true?

Lets take a very simple example.


    public class Hello {
        public static void main(String[] args) {
            System.out.println("Hello World!");
        }
    }

The above example is arguably the simplest program that can be written using Java. It also happens to be the first program usually written by any Java programmer. It simply print outs the text “Hello World!” when it is run. Surely this program is so simple that it is perfect and therefore bug free?

Ignoring the obvious that this program does nothing useful, lets assume for the moment that we have been tasked to write a “Hello World!” program in Java. Surely the above is 100% bug free.

Yes. The application is 100% bug free. But thats not the whole story. What happens when this application run?

The first thing to happen is it needs to be compiled. This takes the application from its text form converting it into something that the computer can understand. In the case of Java it turns it into something the Java Virtual Machine can understand. This allows you to take the same compiled program and in theory run it on your computer, phone, playstation, blu ray, ipad or any other device that runs Java.

The Java Virtual Machine or JVM is itself a compiled program running on a device. The catch is that it is compiled using a different compiler for every platform (computer, phone etc…). When it runs it takes your compiled application and converts the instructions into something that the computer understands.

However underneath the JVM is usually the Operating System. This hands out resources to the programs that are running. So the JVM runs inside the operating system and the operating system talks to the physical hardware.

Actually while the operating system does talk to the hardware directly there is usually software inside the hardware itself which controls the actual hardware these days. Not only does the hardware contain software the hardware itself such as a CPU is literally software turned into hardware. This means CPU’s and the like can also contain bugs.

This means in order for your perfect application to run perfectly the following pieces of software also need to run perfectly,


    Your Program -> Java Compiler -> JVM -> JVM Compiler -> Operating System -> Operating System Compiler -> Hardware Software -> Hardware Software Compiler -> Hardware Itself

As you can see it really is an iceberg with your perfect program at the top and lot of things going on underneath. Any bug in any level can result in “perfect” software not working as expected making it flawed.

This is why perfect software does not currently exist. The only way to do so would to be by writing perfect software at every level which is a monumental undertaking. There are estimates around that suggest that the cost to rewrite the Linux kernel as being around 500 billion dollars, and thats not really accounting for making it “perfect”, and as shown is literally one small piece of the puzzle.

So should we just give in? Well no. At every level there are thousands of testers and processes designed to make the software as bug free as possible. Just because we cannot reach perfection does not mean it is not at least worth trying.

The benefit of testing for Developers, Managers and the Business

“Fixing regression bugs is analogous to digging a hole only to find the next day it has been filled in and having to dig it out again”

Ask any manager, developer or tester working on software without tests what the main pain points are. Nearly all the time the main one mentioned is dealing with regressions. This is bugs that were fixed a year ago which returned. Regression bugs cost the software industry billions of dollars a year. Worse still they are demoralising to everyone involved. Finding or fixing the same bug over and over causes you to start looking for new projects or new jobs.

A good regression test suite generally solves these problems. It may not prevent every bug from reoccurring but it will catch most of them. It also gives peace of mind that you have not reintroduced these bugs again once fixed. Lastly it saves time by checking if the bug is fixed for you.

“Software is never finished, only abandoned”

Another advantage to come out of having a good test suite are the improvements to the software itself. Not only is testable software generally better written then non-testable software, a collection of tests provides a nice safety next when you start to improve your product. Software as a general rule is constantly in flux with changes and improvements or left as is. If your software is never improved or built upon you may want to consider if you really need tests. That said there are other reasons to test software beyond what is mentioned above.

If none of the above points are selling testing to you consider this. As mentioned testable software is usually well designed modular software. This allows greater code reuse saving time and money. It also allows new developers to quickly understand what they are dealing with allowing them to become productive faster. If nothing else writing testable software will save you time and money in the long run by making things easier to understand.

What is Usability Testing?

Usability tests are manual tests used to check that the user interface is understandable. The focus of the tests are to ensure that product meets its intended purpose. These sort of tests can be subjective and are usually impossible to automate. It is important to differentiate usability testing from simply showing an interface to someone and asking them “Do you understand how this?”. It is usually done by creating a scenario such as “Can you find and add this song to a new playlist” and observing the steps that the user takes to perform the task.

Usability tests can be valuable for a variety of reasons. For online applications and sales if a website is difficult to use or the product hard to find the user will leave. Remember your biggest enemy in these cases is the back button. For all software made to be used usable software will improve productivity. This can be especially important in your sales pitch or when trying to move to a new process internally.

Keep in mind before performing any usability testing that there are 5 components relating to quality you should keep in mind.

* Learnability. How easy is it for users to accomplish basic tasks the first time they encounter the interface?
* Efficiency. Once users have learned the interface, how quickly can they perform tasks?
* Memorability. When users return to the interface after a period of not using it, how easily can they re-establish proficiency?
* Errors. How many errors do users make, how severe are these errors, and how easily can they recover from the errors?
* Satisfaction. How pleasant is it to use the design?

One of the quickest ways to perform usability testing is to select random individuals and ask them to use the product of service. This is also known as hallway testing since this can include asking people passing by in the hallway. This can be a very effective method to finding serious problems for some software. Of course this is probably not the most effective way to test specialist software such as ultrasound controllers, but for anything consumer facing it can be an effective technique. Generally you can convince people to do this sort of testing for free or for very low cost if you are polite about it.

Expert reviews are another form of usability testing which can overcome issues with hallway usability testing. It is where experts in a given field are asked to evaluate a product. Generally the following 10 usability heuristics by Nielson are used (taken from wikipedia)

* Visibility of system status. The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
* Match between system and the real world. The system should speak the user’s language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order.
* User control and freedom. Users often choose system functions by mistake and will need a clearly marked “emergency exit” to leave the unwanted state without having to go through an extended dialogue. Support undo and redo.
* Consistency and standards. Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.
* Error prevention. Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action.
* Recognition rather than recall. Minimise the user’s memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate.
* Flexibility and efficiency of use. Shortcuts unknown by the novice user—may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.
* Aesthetic and minimalist design. Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility.
* Help users recognise, diagnose, and recover from errors. Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.
* Help and documentation. Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user’s task, list concrete steps to be carried out, and not be too large.

Of course this is a far more formal method of usability testing. Usually it will result in paying the testers for their time, however some professionals will be happy to do so without cost if you are able to make things convenient for them. Tools such as goto-meeting can help with this.

Mutation Testing or How to Test Tests

Mutation testing is a technique used to verify that tests are providing value. Mutation testing involves modifying the given program in small ways. These could include changing boolean checks such as if a condition is True to being False. A mutated version of code is known as a mutant. For each mutant a the test suite is run against it. The tests when run over the mutant version should have a percentage of failure. Where a mutant is not caught additional tests can be written to cover these cases.

Mutation testing works well where you have a reasonable level of code coverage over your code and can be quite effective when done with test driven development. In my and others experience well written tests should have about 70% failure rate against code that has been mutated.

Quite a few mutation testing frameworks are out there such as Heckle, Insure++, Nester however you can get away with a simple find replace in most cases. Because mutation testing requires a lot of manual intervention to review the results it is usually something you would put into the code review process rather then as part of your continuous integration system. This is because even after mutating the code many of the tests may still pass. This might be due to the code not being conditional and still producing the correct output rather then ineffective tests.

I have written a very simple mutation tester which can be used against most languages and hosted it on github for your convenience https://github.com/boyter/Mutator/

Who is Responsible for Software Quality?

In the beginning of my software development career I was interviewing for an intern position at Microsoft. I never did get the job but one think out of that interview process really stuck with. The second interviewer after the usual getting to know you chat aded me the following question. “On any given software project we have developers, software testers / quality assurance and managers involved. Who is responsible for the quality of the software?”. Being young and naive I confidently responded that the QA/testers were. After a long discussion artfully controlled by the interviewer I came to change my opinion. Below is the line of reasoning I went through with him.

Software testers / quality assurance write code to find bugs and prevent them reoccurring. They are also responsible for performing exploratory testing to identify issues developers did not anticipate. Other testers verify that requirements were implemented correctly. Finally they write bug reports used by developers to fix any issues found in the software. As such testers are responsible for software quality since they catch the bugs. However software testers usually do not fix the code itself. Their bug reports are the developers main feedback and how the bugs are resolved. So this would mean that the developers are actually responsible for software quality.

Developers write the code that makes the software do anything. As such they are responsible for implementing any bug fixes and following processes to ensure that a minimum amount of defects are delivered. They also have to take the requirements for the software and ensure that it is implemented correctly. Given this of course developers are responsible for software quality as they are the ones who actually fix any issues and implement the requirements. Failure to do either produces at best a flawed outcome and and worst a totally broken one. Naturally for any code issues the responsibility of quality lies with the developer then. However what if the requirements are wrong? The developers work against the requirements as do the testers. Managers usually write the requirements so are management responsible for software quality?

Managers, including project and otherwise usually do not usually write code or work to identify bugs. They are however responsible for processes used to gather requirements, ensuring that the team has sufficient time to fix issues and getting any problems out of the developers and testers way. We still have the issue that management doesn’t write code or identify bugs, thats the testers and developers job! That means that the testers and developers are responsible for code quality!

Of course you can now see the circular logic occurring here.

So Who is Responsible?

The answer is now obvious. Who is responsible for software quality? The answer is everyone.