A/B Testing

A/B testing is testing the comparison of two outputs where a single unit has changed. It is commonly used when when trying to increase conversion rates for online websites. Also known as split testing an example would be trying to increase user clicks on a specific button on a website. You may have a theory that red buttons work better then green. You would try out both against real users and see which one performs better.

You should use A/B testing when you have some existing traffic to a website (or mailing list), know what outcomes you are trying to achieve and most importantly are able to track what actions resulted in the outcome. Without these three things split testing will not achieve the outcomes you wish.

Numerous things are able to be A/B split tested and are limited to what you have to pivot on and your imagination. A few examples include,

* Pricing tiers and structures
* Free trial time lengths
* Titles E.G. length and word placement
* Headlines and Sub-Headlines E.G. size, length, word placement
* Forms E.G. adding and removing requested fields
* Paragraph Text E.G. adding more content, changing spacing
* Testimonials E.G. adding more or less, increasing the length
* Call to Action (text, links, buttons or images) E.G. “add to cart” vs “buy now”
* Movement of content around the page
* Press mentions and awards

Things to keep in mind when A/B testing is that running your tests for a long time can result in SEO penalties from Google and other search engines. Quoted from the Google Webmaster Central Blog on Website Testing,

“If we discover a site running an experiment for an unnecessarily long time, we may interpret this as an attempt to deceive search engines and take action accordingly. This is especially true if you’re serving one content variant to a large percentage of your users.”

It is highly recommended to read and understand the post mentioned in order to ensure you are following best practices. The consequences can be dire indeed including being black listed by Google and other search engines as the worst possible result.

A/B testing can be implemented in a variety of ways. Perhaps the best known is using using Google Analytics. However there are other free and paid for solutions. Visual Website Optimizer is one example of a paid for service and if you are using Ruby on Rails there are many libraries to help you out.

A few things to keep in mind when doing A/B testing.

* Test all the variations of a single entity at the same time. If you perform a test on one variant for a week and another the following week your data is likely to be skewed. Its possible that you had some excellent high value links added the second week but it had a lower conversion rate.
* Keep the test running long enough to have confidence in your results but not so long as to be penalised by Google. You need to have statistical significance. Any A/B tool worth money will be able to report on this metric for you and let you know when to finish the test. It is very useful to know how long you are going to run the test before starting.
* Trust the data over feeling. If the data is telling you that an ugly button works better then your beautiful one either trust the data or run the test again at a later date to confirm. It can be hard to do what feels counter intuitive but you need to remember that humans generally are not rational and will not behave how you expect.
* If a user complains about seeing a different price offer them the better deal. Always respect your users and customers. It builds good will. Another thing to do is avoid split testing paying customers. Adobe Omniture runs a lot of A/B tests in their online product and it drives some customer’s crazy as everything they need moves around on a regular basis. Just don’t do it.
* Don’t A/B test multiple things at the same time. If you are doing to A/B test a better design then test the better design against the other one. Don’t chop and change various parts of the website. It will be confusing.
* Keep trying. Its possible a single test will produce no meaningful results. If so try again. Not only will you get better with experience you are more likely to find the correct things to optimise.

A real genius in this space is Patrick McKenzie and a few very worthwhile articles to read about it are A-B Testing Made Me a Small Fortune and A-B Testing. Other articles worth reading include, Practical Guide to Controlled Experiments on the Web by Microsoft Research (PDF), Writing Decisions: Headline Tests on the Highrise Sign-Up Page], “You Should Follow Me on Twitter Here”, How We Increased our Conversion Rate by 72%, Human Photos Double your Conversion Rate

Five ways to avoid and control flaky tests

Having a reliable test suite should always be the goal in software development. After all if you can’t trust the tests then why bother running them at all? This is especially important in a shared coding environment and when running through Continuous Integration (CI).

keep-calm-and-don-t-break-the-build-e1437343295860

1. Test in Isolation

It may seem obvious but writing focused tests which do a single thing is one of the most effective ways to avoid them being flaky. Tests which do multiple things increases the chance for failure and can make the tests non deterministic. Always remember to test features and issues in isolation.

2. Write Helpful Error’s

When a test does fail having an error such as “Error 23: Something went wrong ¯\_(ツ)_/¯” is incredibly frustrating. Firstly you need to run the test again with either a debugger or some code modifications to spot the bug which slows down development its also unprofessional. Write meaningful error messages. For example “Error: The value “a” was unexpected for this input” is a far better error. Another thing to remember is avoid swallowing the exception in languages which support exception handling.

3. Control the Environment

Regularly run tests should run in a controlled environment which will be the same for the current test and future tests. This usually means a clean deploy, restoring databases and generally ensuring that however the application was setup originally is done again. This ensures the tests always start with the same conditions. This also ensures you have a good CI process and are able to recreate environments from scratch when required which is good development process.

4. Fix it, delete it or mark it

A test that fails is proving its value, unless its flaky. Tests that fail randomly slow down your development process. In time they will be ignored and neglected. The moment a test is identified as failing it should be fixed. If it will take time then mark it as being flaky, remove it from the CI pipeline and investigate as part of paying down technical debt. If after time it still isn’t resolved it should be investigated to see if it is providing any actual value. Odds are if it hasn’t been fixed for a month it may be a test you can live without.

5. Be forgiving but verify

For any integration test you need to have your tests be forgiving in how the application responds. After all submitting an image into a text field may result in an error which is probably acceptable. Other things to keep in mind are that there will be timeouts you will need to deal with. Be sure to have a reasonable length of time to wait for a response and only once this has expired to fail. Be wary of any test that waits forever for something to happen.

Why Does Software Contain Bugs?

“Why does all software contain bugs?” this was a question recently asked of me. My response at the time was because all software is not perfect, but is this true?

Lets take a very simple example.


    public class Hello {
        public static void main(String[] args) {
            System.out.println("Hello World!");
        }
    }

The above example is arguably the simplest program that can be written using Java. It also happens to be the first program usually written by any Java programmer. It simply print outs the text “Hello World!” when it is run. Surely this program is so simple that it is perfect and therefore bug free?

Ignoring the obvious that this program does nothing useful, lets assume for the moment that we have been tasked to write a “Hello World!” program in Java. Surely the above is 100% bug free.

Yes. The application is 100% bug free. But thats not the whole story. What happens when this application run?

The first thing to happen is it needs to be compiled. This takes the application from its text form converting it into something that the computer can understand. In the case of Java it turns it into something the Java Virtual Machine can understand. This allows you to take the same compiled program and in theory run it on your computer, phone, playstation, blu ray, ipad or any other device that runs Java.

The Java Virtual Machine or JVM is itself a compiled program running on a device. The catch is that it is compiled using a different compiler for every platform (computer, phone etc…). When it runs it takes your compiled application and converts the instructions into something that the computer understands.

However underneath the JVM is usually the Operating System. This hands out resources to the programs that are running. So the JVM runs inside the operating system and the operating system talks to the physical hardware.

Actually while the operating system does talk to the hardware directly there is usually software inside the hardware itself which controls the actual hardware these days. Not only does the hardware contain software the hardware itself such as a CPU is literally software turned into hardware. This means CPU’s and the like can also contain bugs.

This means in order for your perfect application to run perfectly the following pieces of software also need to run perfectly,


    Your Program -> Java Compiler -> JVM -> JVM Compiler -> Operating System -> Operating System Compiler -> Hardware Software -> Hardware Software Compiler -> Hardware Itself

As you can see it really is an iceberg with your perfect program at the top and lot of things going on underneath. Any bug in any level can result in “perfect” software not working as expected making it flawed.

This is why perfect software does not currently exist. The only way to do so would to be by writing perfect software at every level which is a monumental undertaking. There are estimates around that suggest that the cost to rewrite the Linux kernel as being around 500 billion dollars, and thats not really accounting for making it “perfect”, and as shown is literally one small piece of the puzzle.

So should we just give in? Well no. At every level there are thousands of testers and processes designed to make the software as bug free as possible. Just because we cannot reach perfection does not mean it is not at least worth trying.

searchcode the path to profitability

One of the things that has always bothered me about searchcode.com was that it never generated any money. Not a huge problem in itself as a side project, but the costs to run it are not insignificant due to the server requirements. I had looked into soliciting donations but I considered this highly unlikely to produce enough revenue to cover costs considering that sites such as gwern.net was unable to make enough to cover even basic costs through patreon (although since a recent HN post this has jumped from around $20 a month to over $150).

This had caused me back in the early days to use buysellad’s to attempt to cover some costs. While this certainly helped there was usually not enough revenue due to the way the ads are sold. The issue with buysellads is that you have to pitch your website as a good place to sell ads against. This is not something I had any great desire to do. Simply put if I am going to spend my time marketing something its going to be something that is more directly marketable then an advertising funded website. The other issue is that buysellads does not work with HTTPS which became a deal breaker for myself once Google announced that they were going to use it as a signal for ranking.

This lead me to just a few months ago considering shutting down the site. However while doing some poking around I noticed a newish advertising platform called Carbonads. Invite only I decided to email them with my pitch. Being a developer/designer focused ad platform it seemed like a natural fit. After a bit of back and forth I can now happily say that searchcode.com is running carbonad’s. I have no numbers to report at this time but I am very hopeful based on some estimates made with carbon ads that I should be able to push searchcode to cover costs and hopefully produce some profit, all of which will be used to improve the service.

The benefit of testing for Developers, Managers and the Business

“Fixing regression bugs is analogous to digging a hole only to find the next day it has been filled in and having to dig it out again”

Ask any manager, developer or tester working on software without tests what the main pain points are. Nearly all the time the main one mentioned is dealing with regressions. This is bugs that were fixed a year ago which returned. Regression bugs cost the software industry billions of dollars a year. Worse still they are demoralising to everyone involved. Finding or fixing the same bug over and over causes you to start looking for new projects or new jobs.

A good regression test suite generally solves these problems. It may not prevent every bug from reoccurring but it will catch most of them. It also gives peace of mind that you have not reintroduced these bugs again once fixed. Lastly it saves time by checking if the bug is fixed for you.

“Software is never finished, only abandoned”

Another advantage to come out of having a good test suite are the improvements to the software itself. Not only is testable software generally better written then non-testable software, a collection of tests provides a nice safety next when you start to improve your product. Software as a general rule is constantly in flux with changes and improvements or left as is. If your software is never improved or built upon you may want to consider if you really need tests. That said there are other reasons to test software beyond what is mentioned above.

If none of the above points are selling testing to you consider this. As mentioned testable software is usually well designed modular software. This allows greater code reuse saving time and money. It also allows new developers to quickly understand what they are dealing with allowing them to become productive faster. If nothing else writing testable software will save you time and money in the long run by making things easier to understand.

AWS EC2 Instance Types to Use as Test Agents

When you are running test agents on AWS knowing what instance type to run as test agents (for TeamCity or otherwise) can involve a lot of trial and error. Not only can there be great savings to be made by picking the correct instance type you can speed up your builds and get test feedback back faster which can be far more valuable the cost of a few additional cents an hour.

The following are some results that I have found when playing with different instance types. Before Amazon released the burstable t2 instance types one of the most common instances I had seen used was general purpose instances such as the m3.medium. This always seemed like a good choice as most tests tend to use a mixture of CPU/Disk/Network and thats what the agents are supposed to be good at.

The moment that the bustable instances were released several agents were relaunched as t2.mediums and left in the cluster for a week.

The outcome was that no only were they saving money since they cost lest per month, they were able to run tests and build faster then the previous agents. This was a surprise at first until we observed that with very few exceptions every test was CPU bound. This included browser tests which we had expected to be more network bound. The performance increase as such was mostly down to them accumulating credits over time faster then they could be spent. See the below image which was taken from a live instance where you can clearly see how this works.

t2_credit_usage

For the record this agent runs 24/7 running builds and tests over dozens of different projects including a lot of selenium tests for multiple browsers.

There were however a few tests which consumed considerably more CPU then expected. These tests comprised of a collection of very heavy math operations and integrations all running on the same machine. A single agent was boosted to a c4.medium to take care of these tests and everything has been working fine since. Build times were down and the developers had feedback sooner.

We also tried relaunching the instances with a higher number, such as a m3.large into a m4.large and the result was far faster builds. This is probably due to the underlying hardware AWS is using being faster. It was however still worth using t2 agents due to the cost saving and roughly equivalent performance.

Conclusions

It really depends on your environment and how much you are using the agents. I think the following guidelines apply fairly well though.

* For any test agents running Windows you want the minimum of a t2.medium on AWS or the equivalent with 2 CPU’s and 4 gig of RAM.
* Test agents running Linux want to be the minimum of a t2.small on AWS or the equivalent with a single CPU and 2 gig of RAM.
* For agents that run tests infrequently as in less then 6 times an hour stick with the lower end t2 instances.
* For agents that run heavy loads consider using a c4.large as the increased CPU will really cut down on the test time.
* Always go for the latest type in AWS, IE use a c4.large over a c3.large for increased performance

The main takeaway however is ensure you can relaunch your instances as different types easily. Try out different types and see what happens. The winning strategy I found was to launch as a t2.medium at first and then dial it down to a t2.small if it was overpowered (which was never the case for Windows) and relaunch as a c4.medium if it was underpowered.

The result was much faster builds saving developer time and frustration.

Issues with Google’s Bug Prediction Algorithm

December 2011 the Google Engineering team published a blog post about bug prediction at Google. The topic caused quite a lot of discussion at the time over the internet on forums such as Hacker News and the Reddit programming sub-reddit.

How bug prediction works

In a nutshell the prediction works by ranking files against checking the file commit history and seeing how many changes have been flagged as bug fixes. Of course this means that code which was previously buggy will still appear in the list. This issue was also addressed in the post and the results were weighted over time to deal with this issue.

Issues with bug prediction

Since that time the topic has been reposted a few times and we have since discovered that the system has been discontinued at Google. Thankfully the author of the original post was able to respond and has given one of the main reasons why it was discontinued.

TL;DR is that developers just didn’t find it useful. Sometimes they knew the code was a hot spot, sometimes they didn’t. But knowing that the code was a hot spot didn’t provide them with any means of effecting change for the better. Imagine a compiler that just said “Hey, I think this code you just wrote is probably buggy” but then didn’t tell you where, and even if you knew and fixed it, would still say it due to the fact it was maybe buggy recently. That’s what TWR essentially does. That became understandably frustrating, and we have many other signals that developers can act on (e.g. FindBugs), and we risked drowning out those useful signals with this one.

Some teams did find it useful for getting individual team reports so they could focus on places for refactoring efforts, but from a global perspective, it just seemed to frustrate, so it was turned down.

From an academic perspective, I consider the paper one of my most impactful contributions, because it highlights to the bug prediction community some harsh realities that need to be overcome for bug prediction to be useful to humans. So I think the whole project was quite successful… Note that the Rahman algorithm that TWR was based on did pretty well in developer reviews at finding bad code, so it’s possible it could be used for automated tools effectively, e.g. test case prioritization so you can find failures earlier in the test suite. I think automated uses are probably the most fruitful area for bug prediction efforts to focus on in the near-to-mid future.

Another paper released which analyses the results of the Google bug predictor can be found Does Bug Prediction Support Human Developers? Findings From a Google Case Study (PDF).

Another interesting thought is that the system requires a reasonably amount of code comment discipline. Some people and teams use bug fix commit’s to resolve feature requests. The result is that this system would mark actively developed code as being a bug hot spot. In this case being a poorly organised team or individual would not see their code appear. This is especially problematic where performance is being measured against these sort of metrics.

If you are interested in trying this yourself there are some open source implementations of the algorithms presented in the Google Paper. See bugspots in github as an example of one which will work against any git repository.

What is Usability Testing?

Usability tests are manual tests used to check that the user interface is understandable. The focus of the tests are to ensure that product meets its intended purpose. These sort of tests can be subjective and are usually impossible to automate. It is important to differentiate usability testing from simply showing an interface to someone and asking them “Do you understand how this?”. It is usually done by creating a scenario such as “Can you find and add this song to a new playlist” and observing the steps that the user takes to perform the task.

Usability tests can be valuable for a variety of reasons. For online applications and sales if a website is difficult to use or the product hard to find the user will leave. Remember your biggest enemy in these cases is the back button. For all software made to be used usable software will improve productivity. This can be especially important in your sales pitch or when trying to move to a new process internally.

Keep in mind before performing any usability testing that there are 5 components relating to quality you should keep in mind.

* Learnability. How easy is it for users to accomplish basic tasks the first time they encounter the interface?
* Efficiency. Once users have learned the interface, how quickly can they perform tasks?
* Memorability. When users return to the interface after a period of not using it, how easily can they re-establish proficiency?
* Errors. How many errors do users make, how severe are these errors, and how easily can they recover from the errors?
* Satisfaction. How pleasant is it to use the design?

One of the quickest ways to perform usability testing is to select random individuals and ask them to use the product of service. This is also known as hallway testing since this can include asking people passing by in the hallway. This can be a very effective method to finding serious problems for some software. Of course this is probably not the most effective way to test specialist software such as ultrasound controllers, but for anything consumer facing it can be an effective technique. Generally you can convince people to do this sort of testing for free or for very low cost if you are polite about it.

Expert reviews are another form of usability testing which can overcome issues with hallway usability testing. It is where experts in a given field are asked to evaluate a product. Generally the following 10 usability heuristics by Nielson are used (taken from wikipedia)

* Visibility of system status. The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
* Match between system and the real world. The system should speak the user’s language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order.
* User control and freedom. Users often choose system functions by mistake and will need a clearly marked “emergency exit” to leave the unwanted state without having to go through an extended dialogue. Support undo and redo.
* Consistency and standards. Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.
* Error prevention. Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action.
* Recognition rather than recall. Minimise the user’s memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate.
* Flexibility and efficiency of use. Shortcuts unknown by the novice user—may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.
* Aesthetic and minimalist design. Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility.
* Help users recognise, diagnose, and recover from errors. Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.
* Help and documentation. Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user’s task, list concrete steps to be carried out, and not be too large.

Of course this is a far more formal method of usability testing. Usually it will result in paying the testers for their time, however some professionals will be happy to do so without cost if you are able to make things convenient for them. Tools such as goto-meeting can help with this.

Mutation Testing or How to Test Tests

Mutation testing is a technique used to verify that tests are providing value. Mutation testing involves modifying the given program in small ways. These could include changing boolean checks such as if a condition is True to being False. A mutated version of code is known as a mutant. For each mutant a the test suite is run against it. The tests when run over the mutant version should have a percentage of failure. Where a mutant is not caught additional tests can be written to cover these cases.

Mutation testing works well where you have a reasonable level of code coverage over your code and can be quite effective when done with test driven development. In my and others experience well written tests should have about 70% failure rate against code that has been mutated.

Quite a few mutation testing frameworks are out there such as Heckle, Insure++, Nester however you can get away with a simple find replace in most cases. Because mutation testing requires a lot of manual intervention to review the results it is usually something you would put into the code review process rather then as part of your continuous integration system. This is because even after mutating the code many of the tests may still pass. This might be due to the code not being conditional and still producing the correct output rather then ineffective tests.

I have written a very simple mutation tester which can be used against most languages and hosted it on github for your convenience https://github.com/boyter/Mutator/

Who is Responsible for Software Quality?

In the beginning of my software development career I was interviewing for an intern position at Microsoft. I never did get the job but one think out of that interview process really stuck with. The second interviewer after the usual getting to know you chat aded me the following question. “On any given software project we have developers, software testers / quality assurance and managers involved. Who is responsible for the quality of the software?”. Being young and naive I confidently responded that the QA/testers were. After a long discussion artfully controlled by the interviewer I came to change my opinion. Below is the line of reasoning I went through with him.

Software testers / quality assurance write code to find bugs and prevent them reoccurring. They are also responsible for performing exploratory testing to identify issues developers did not anticipate. Other testers verify that requirements were implemented correctly. Finally they write bug reports used by developers to fix any issues found in the software. As such testers are responsible for software quality since they catch the bugs. However software testers usually do not fix the code itself. Their bug reports are the developers main feedback and how the bugs are resolved. So this would mean that the developers are actually responsible for software quality.

Developers write the code that makes the software do anything. As such they are responsible for implementing any bug fixes and following processes to ensure that a minimum amount of defects are delivered. They also have to take the requirements for the software and ensure that it is implemented correctly. Given this of course developers are responsible for software quality as they are the ones who actually fix any issues and implement the requirements. Failure to do either produces at best a flawed outcome and and worst a totally broken one. Naturally for any code issues the responsibility of quality lies with the developer then. However what if the requirements are wrong? The developers work against the requirements as do the testers. Managers usually write the requirements so are management responsible for software quality?

Managers, including project and otherwise usually do not usually write code or work to identify bugs. They are however responsible for processes used to gather requirements, ensuring that the team has sufficient time to fix issues and getting any problems out of the developers and testers way. We still have the issue that management doesn’t write code or identify bugs, thats the testers and developers job! That means that the testers and developers are responsible for code quality!

Of course you can now see the circular logic occurring here.

So Who is Responsible?

The answer is now obvious. Who is responsible for software quality? The answer is everyone.