The Worst Individual I Ever Worked With

Taken from a comment I posted on HN in a thread about a Soccer Con Man.

Not actually a programmer. The guy was hired to be a project manager.

After joining things were as expected but after a few weeks we noticed that he was rarely around after lunch and never around after lunch on a Friday.

We would email him at those times deliberately to catch him out and I recall starting to put sticky notes on his laptop “Came to see you a X time”. He would come back and just dump all the notes in the rubbish and claim he never got them. He would often claim to be working from home, despite his laptop being on his desk and usually closed. He would also never responding during those times to email or IM.

A classic seagull manager he would appear when something went wrong, making a lot of noise, writing a lot of emails and then vanishing. He would also be sure to be seen when something was delivered often staying back late on those times.

It got so bad one friend of mine started tracking when he was around and then tracking when one of his relatives died. During his tenure the following incidents occurred,

– Hot water system blew up. 4 times. He had pictures which he would show all the time.

– Uncles, aunts, and various over family members died to the total of 20 individuals.

– Our time tracking him showed him to be in the office less than 15 hours a week on average.

We started to suspect he had a second job and was pulling the same con on them. This was never proved, but we did find someone who had worked with him previously and they reported the same behavior.

The worst thing was it was raised with management at least several dozen times and nothing ever happened. He managed to pull this scam off for 4 years. I could not believe the waste of money this guy was, literally $500,000 burnt on a useless individual.

Types of Testing in Software Engineering

There are many different types of testing which exist in software engineering. They should not be confused with the test levels, unit testing, integration testing, component interface testing, and system testing. However the different test levels may be used by each type as a way of checking for software quality.

The following are all different types of tests in software engineering.

: A/B testing is testing the comparison of two outputs where a single unit has changed. It is commonly used when trying to increase conversion rates for online websites. A real genius in this space is Patrick McKenzie and a few very worthwhile articles to read about it are How Stripe and AB Made me A Small Fortune and AB Testing

: Acceptance tests usually refer to tests performed by the customer. Also known as user acceptance testing or UAT. Smoke tests are considered an acceptance test.

: Accessibility tests are concerned with checking that the software is able to be used by those with vision, hearing or other impediments.

: Alpha testing consists of operational testing by potential users or an independent test team before the software is feature complete. It usually consists of an internal acceptance test before the software is released into beta testing.

: Beta testing follows alpha testing and is form of external user acceptance testing. Beta software is usually feature complete but with unknown bugs.

: Concurrent tests attempt to simulate the software in use under normal activity. The idea is to discover defects that occur in this situation that are unlikely to occur in other more granular tests.

: Conformance testing verifies that software conforms to specified standards. An example would checking a compiler or interpreter to see if it will work as expect against the language standards.

: Checks that software is compatible with other software on a system. Examples would be checking the Windows version, Java runtime version or that other software to be interfaced with have the appropriate API hooks.

: Destructive tests attempt to cause the software to fail. The idea being to check that software continues to work even with given unexpected conditions. Usually done through fuzzy testing and deliberately breaking subsystems such as the disk while the software is under test.

: Development testing is testing done by both the developer and tests during the development of the software. The idea is to prevent bugs during the development process and increase the quality of the software. Methodologies to do so include peer reviews, unit tests, code coverage and others.

: Functional tests generally consist of stories focussed around the users ability to perform actions or use cases checking if functionality works. An example would be “can the user save the document with changes”.

: Ensures that software is installed correctly and works as expected on a new piece of hardware or system. Commonly seen after software has been installed as a post check.

: Internationalisation tests check that localisation for other countries and cultures in the software is correct and inoffensive. Checks can include checking currency conversions, word range checks, font checks, timezone checks and the like.

Non functional
: Non functional tests test the parts of the software that are not covered by functional tests. These include things such as security or scalability which generally determine the quality of the product.

Performance / Load / Stress
: Performance load or stress testing is used to see how a system performance under certain high or low workload conditions. The idea is to see how the system performs under these conditions and can be used to measure scalability and resource usage.

: Regression tests are an extension of sanity checks which aim to ensure that previous defects which had a test written do not re-occur in a given software product.

: Realtime tests are to check systems which have specific timing constraints. For example trading systems or heart monitors. In these case real time tests are used.

Smoke / Sanity
: Smoke testing ensures that the software works for most of the functionality and can be considered a verification or acceptance test. Sanity testing determines if further testing is reasonable having checked a small set of functionality for flaws.

: Security testing concerned with testing that software protects against unauthorised access to confidential data.

: Usability tests are manual tests used to check that the user interface if any is understandable.

Syncing Stash/BitBucket with searchcode server

Recently it came up to perform a slight integration piece between a on premises Stash/BitBucket install and a searchcode server install. Thankfully both have an API and very thankfully there is a nice Python library for talking to Stash/BitBucket.

Below is the code used. It pulls out all of the repositories from every project, checks if it exists in searchcode and if not adds it as a repository to be indexed. You need to install stashy (pip install stashy) and run it whenever you have new repositories. One idea is to set it as a cron task and ensure everything is in sync.

Note that this does not remove repositories that have been indexed, but it would not take much work to achieve it.

import stashy
from hashlib import sha1
from hmac import new as hmac
import urllib2
import json
import urllib

def getstashrepos():
    stash = stashy.connect("https://mystashserver/", "STASH_USERNAME", "STASH_PASSWORD")

    projects = stash.projects.list()
    repos = [stash.projects[x['key']].repos.list() for x in projects]

    stashrepos = []

    for repo in repos:
        stashrepos = stashrepos + [{'name': x['project']['key'] + '-' + x['slug'],
                                    'cloneUrl': x['cloneUrl'],
                                    'browse': x['links']['self'][0]['href']} for x in repo]

    return stashrepos

def addtosearchcode(repo):
    reponame = repo['name']
    repourl = repo['cloneUrl']
    repotype = "git"
    repousername = "STASH_USERNAME"
    repopassword = "STASH_PASSWORD"
    reposource = repo['browse']
    repobranch = "master"

    message = "pub=%s&reponame=%s&repourl=%s&repotype=%s&repousername=%s&repopassword=%s&reposource=%s&repobranch=%s" % (

    sig = hmac(privatekey, message, sha1).hexdigest()

    url = "http://mysearchcodeserver/api/repo/add/?sig=%s&%s" % (urllib.quote_plus(sig), message)

    data = urllib2.urlopen(url)
    data =

    data = json.loads(data)
    print reponame, data['sucessful'], data['message']


message = "pub=%s" % (urllib.quote_plus(publickey))

sig = hmac(privatekey, message, sha1).hexdigest()
url = "http://mysearchcodeserver/api/repo/list/?sig=%s&%s" % (urllib.quote_plus(sig), message)

data = urllib2.urlopen(url)
data =

data = json.loads(data)
existingrepos = [x['name'] for x in data['repoResultList']]

for repo in getstashrepos():
    if repo['name'] not in existingrepos:

Python Fabric: Getting File from Host as String

When using fabric for deployments you will sometimes want check an existing file for the presence of a value before applying an update. A common example I run into is checking if an apt-source has already been added before adding it again. This is a little clunky in fabric, but thankfully you can write a simple helper which takes case of it for you.

def _get_remote(fileloc):
    '''Pulls back a file contents from connection as string'''
    from StringIO import StringIO

    fd = StringIO()
    get(fileloc, fd)
    content = fd.getvalue()
    return content

Usage is fairly simple. Say we want to install the latest version of Varnish Cache on an Ubuntu server. Usage like so works,

if ' trusty varnish-4.0' not in _get_remote('/etc/apt/sources.list.d/varnish-cache.list'):
    sudo('curl | sudo apt-key add -')
    sudo('''sudo sh -c 'echo "deb trusty varnish-4.0" >> /etc/apt/sources.list.d/varnish-cache.list' ''')
    sudo('apt-get -y update')

At this point you should be able to install the latest version of varnish with a simple apt-get. The Architecture – migration 3.0

27th July 2016 at about 9:30pm my local time (GMT +10) I updated the A records for searchcode’s nameservers to point at a new stack that has been several months in the making. As with most posts of this sort of nature a quick recap of where things were and where they are now.

The previous searchcode stack consisted of two dedicated servers hosted by Hetzner. I have previously discussed this about two years ago when discussing searchcode next. The first server was a reasonably powerful machine running pretty much all of code required to deliver itself with the exception of the actual index. searchcode is a Django application and was using nginx to serve results directly out of memcached where possible to avoid consuming a running Gunicorn process. The move to have another server for the index came pretty quickly after searchcode was released as it was just not performant enough with everything on a single box which was the situation for a short time.

searchcode before

This structure worked well for the last 2 years or so but I had noticed that the load average on the frontend was starting to average out at about 3.0+ and would quite often rise to 7.0+ Considering the machine had only 4 real CPU cores this was an issue. Interestingly it also caused the number of requests that searchcode could respond to to max out. In addition the MySQL database had some corruption somewhere in the middle which made to app increasingly unstable. The application was going down where nothing short of a reboot would save it at least once a month. Finally I had grown as a developer over the last two years and it was time to move to something more stable and better performing.

The first thing in this modern cloud world was start looking at moving to something like AWS or DigitalOcean. I firstly created a simple spreadsheet with what I was looking to move towards along with expected costs. In short I wanted to break searchcode apart so that it consisted of the following parts,

Frontend server. Would provide SSL termination to Varnish and reverse proxy back to application servers. Requires, either a lot of RAM or fast disk. If RAM is low but disk is fast can use Varnish with disk cache.

Backend server. Would run the application itself. Scales horizontally. Requires a fast CPU to process the code results.

Indexer server. Would host the Sphinx indexer. Scales horizontally. Requires a fast CPU, reasonably amounts of RAM and about 25 GB of amount of disk space per CPU core, preferably SSD.

Database server. Would host the MySQL database. Does not scale horizontally. Requires lots of RAM (more than 4 GB) and 2 TB of disk space.

With these requirements in mind. I started shopping around.

AWS was ruled out almost immediately due to the cost. This was in spite of the fact that I have a great deal of experience dealing with the AWS API’s. As a rule I run as lean as possible and AWS while brilliant ended up costing more than I would have liked.

I then started looking at DigitalOcean. I have always been a fan of how fast they spin instances up, the simple API and their prices. They also have excellent support and are pretty lenient when it comes to how much CPU and DiskIO you consume. When I started looking however they had not launched Block Storage, which meant I needed the $640 a month plan for the database which was even more expensive then AWS. Later they did release block storage, but the resulting price was still rather high. I still use DigitalOcean for spinning up test stacks.

I also looked at Vultr. It’s pretty safe to say they are a DigitalOcean clone but with more data centers, competitive prices and much worse customer support. They do have one intriguing server option however not mentioned on their public website, which is a storage server. It’s a server backed by a regular spinning rust style disk, but as such is far cheaper than anything offered by any other company. A database server with 1 TB of storage, 4 GB RAM and 2 CPU’s (the largest storage instance they have) is only $40 a month.

A believer in infrastructure as code I was very keen to move to one of the cloud servers and picked Vultr based on the storage instance and cost. I started importing the searchcode database into the storage instance. However about 3/4 the way through I received an email from Vultr support saying that I was exceeding the DiskIO of the storage instance using a sustained 125 IOPS. Fair enough, and I would have been willing to throttle it back if I could gain some assurance about the terms of service. Sadly Vultr lived up to their reputation of having poor support and I am yet to hear back. I terminated the instance and started looking again. They do offer dedicated instances and the prices are actually not too bad, but did not have enough disk space for my needs. I still use them for spinning up test stacks as they have a Sydney data centre which for me has lower network latency and offsets the slower creation time.

I realised that the only real option for something like searchcode (which is fully bootstrapped) is that I need as much power as possible and the lowest price. I also don’t mind getting my hands dirty and can deal without the support. Lastly I need to be able to abuse the machine as I see fit and the only way to do that is go dedicated.

Thankfully Hetzer has a very nice server auction house. You can browse through and pick up used servers for a considerable discount over a new order. In addition you don’t have to pay the setup fee. To avoid the database corruption issue I experienced previously I went shopping for a ECC RAM server for the database and quickly found a tidy machine with 32 GB RAM and enough disk space. For the frontend I just picked up the cheapest 32 GB machine I could find which interestingly was similar to my previous frontend machine but with 32 GB of RAM. Lastly I went looking for something to server as the backend and the indexer. I quickly realized that I could combine both the machines into one and save a few dollars. With this saving I was able to overlook the initial setup fee and went for two machines with 32 GB of RAM and 500 GB SSD’s. Since Hetzner cannot spin servers up and down like a cloud provider I hard-coded the details into my fab file for building the stack but otherwise everything deploys as though I was using a cloud provider. Only it costs considerably less and should be much much faster.

searchcode current

The results?

Well as mentioned the previous instances were sitting around a load average of 3.0+ most of the time. The new backend/indexer boxes are sitting at a load average of 0.1+ which is a massive improvement. The database and frontend are similarly loaded. The DNS at this point is still flipping over for some so its not serving all results yet but I cannot imagine the load rising beyond 1.0+ for everything.

With the hardware decisions discussed lets dive into the software starting with the data storage. MySQL has always been my database of choice simply because I am more familiar with it than any other database. I had previously toyed with migrating to Postgresql but decided against it simply because there are other things I should focus my time on that actually provide real value. As such I have stuck with that for the latest searchcode version however one change I did make was to upgrade to 5.7 so I can leverage the native JSON data type. Otherwise the only change was to modify the MySQL config to take advantage of the power of the box as per best practices.

The backend machines have Nginx set to listen to incoming connections from the frontend. They pass requests back to Gunicorn/Django which performs the appropriate action. Where possible the result using Django/Memcached is served directly, if not a request to Sphinx may be made for search results followed by a lookup to the database, or a direct request to the database. Gunicorn is configured to have four worker processes per box. Sphinx is configured using a distributed index with 4 agents running on each box and one box as the master. Memcached is currently configured with 8 GB of RAM for each backend instance but is not pooled together.

The frontend machine is configured using Varnish with 24 GB of the RAM allocated to cache. The remaining 8 GB is deliberately left over for use by OS caching and Nginx. Nginx is the frontend to the whole system providing SSL termination and proxying back to Varnish. I did briefly consider using HAProxy for this role, but since I was already familiar with Nginx and at the scale searchcode currently operates at Nginx was a better choice. If in time there is much greater load moving to HAProxy is something that will be considered.

Whats next? Well the future of at this point is to leverage the new machines to increase the size of the index and make it refresh the code more quickly. I have a few plans on how to do so and will release details in the next update. If you have read this far you are a beast! Thanks for the support and feel free to email with any further questions.

How to Hide Methods From Fabric Task Listing

Occasionally you may want to hide a method from appearing inside the fabric listing of available tasks. Usually its some sort of helper method you have created that is shared by multiple tasks. So how to hide it? Simply prefix with _

For example,

def _apt_get(packages):
    '''Makes installing packages easier'''
    sudo('apt-get update')
    sudo('apt-get -y --force-yes install %s' % packages)

When listing the fabric tasks this method will no longer appear in the results.

Python Fabric How to Show or List All Available Tasks

Showing or displaying the available tasks inside a fabric fabfile is one of those things that almost everyone wants to do at some point and usually works out you can just request a task you know will not exist (usually found through a typo). However there is a way to list them built into fabric itself.

The below are all methods which can be used to display the currently defined tasks.

fab -l 
fab -list
fab taskthatdoesnotexist

Try any of the above where a fabfile is located and be presented with a list of all the available tasks.

Set Ubuntu Linux Swapfile Using Python Fabric

Annoyingly most cloud providers have an irritating habit of not adding any swap memory to any instance you spin up. Probably because if they added swap to the instance the disk size would appear to be smaller then it is or if they had a dedicated swap partition they would have to bear the cost or again use some of your disk space.

Thankfully adding swap to your Ubuntu linux instance is fairly easy. The following task when run will check if a swapfile already exists on the host and if not create one, mount it and set it to be remounted when the instance is rebooted. It takes in a parameter which specifies the size of the swap in gigabytes.

def setup_swapfile(size=1):
    if fabric.contrib.files.exists('/swapfile') == False:
        sudo('''fallocate -l %sG /swapfile''' % (size))
        sudo('''chmod 600 /swapfile''')
        sudo('''mkswap /swapfile''')
        sudo('''swapon /swapfile''')
        sudo('''echo "/swapfile   none    swap    sw    0   0" >> /etc/fstab''')

BTW I am writing a book about how to Automate your life using Python Fabric click the link and register your interest for when it is released.

Python Fabric Set Host List at Runtime

With the advent of cloud computing where you spin up and tear down servers at will it becomes extremely useful to pick the hosts you want fabric to run on at runtime rather then through the usual env.hosts setting. This allows you to query your servers through your cloud providers API without having to maintain a list. This can be a more powerful and flexible technique then using roles and in a devops world can save you a lot of time.

The trick is to know that when fabric runs any outgoing SSH connection is only made when the first put/get/sudo/run command is made. This means you can change env.hosts before this time and target whatever machines you want.

For the example below we are going to run a command on all of our servers after getting a full IP list from our fictitious cloud provider through an API call using the excellent Python requests module.

Firstly lets define a task which will set our env.hosts to all of the servers in our cloud.

def all():
    req = requests.get('', headers={ 'API-Key': 'MYAPIKEY' });
    serverlist = json.loads(req.text)

    if len(serverlist) == 0:
        evn.hosts = []

    env.hosts = [server['public_ip'] for server in serverlist]

The above makes a HTTPS call using our API Key and loads the JSON response into an object. Then depending on if any servers exist or not we loop through pulling out the public ip address for all our servers and assign that to our environment hosts.

Now we can call it with a simple uname function to get the output for all of the servers inside our cloud.

$ fab all hostname
[box1] run: uname -s
[box1] out: box1
[box2] run: uname -s
[box2] out: box2

Disconnecting from box1... done.
Disconnecting from box2... done.

You can create individual tasks for each group of servers you control using this technique but that’s not very dry (don’t repeat yourself) or neat. Lets modify our all task to accept a parameter so we can filter down our servers at run time.

def all(server_filter=None):
    req = requests.get('', headers={ 'API-Key': 'MYAPIKEY' });
    serverlist = json.loads(req.text)

    if len(serverlist) == 0:
        evn.hosts = []
    if filter:
        env.hosts = [server['public_ip'] for server in serverlist if server['tag'] == server_filter]
        env.hosts = [server['public_ip'] for server in serverlist]

We changed our method to accept a parameter which is by default set to none which we can use to filter down our servers based on a tag. If your cloud providers API is sufficiently powerful you can even change the request itself to handle this use case and save yourself the effort of filtering after the return.

To call our new method you need to pass the filter like the following example.

$ fab all:linux hostname
[box1] run: uname -s
[box1] out: box1
[box2] run: uname -s
[box2] out: box2

Disconnecting from box1... done.
Disconnecting from box2... done.

BTW I am writing a book about how to Automate your life using Python Fabric click the link and register your interest for when it is released.

What is Chaos Testing / Engineering

A blog post by the excellent technical people at Netflix about Chaos Engineering and further posts about the subject by Microsoft in Azure Search prompted me to ask the question, What is chaos engineering and how can chaos testing be applied to help me?

What is Chaos Testing?

First coined by the afore mentioned Netflix blog post, chaos engineering takes the approach that regardless how encompassing your test suite is, once your code is running on enough machines and reaches enough complexity errors are going to happen. Since failure is unavoidable, why not deliberately introduce it to ensure your systems and processes can deal with the failure?

To accomplish this, Netflix created the Netflix Simian Army, which consists of a series of tools known as “monkeys” (AKA Chaos Monkey’s) that deliberately inject failure into their services and systems. Microsoft adopted a similar approach by creating their own monkey’s which were able to inject faults into their test environments.

What are the advantages of Chaos Testing?

The advantage of chaos engineering is that you can quickly smoke out issues that other testing layers cannot easily capture. This can save you a lot of downtime in the future and help design and build fault tolerant systems. For example, Netflix runs in AWS and as a response to a regional failure changed their systems to become region agnostic. The easiest way to confirm this works is to regularly take down important services in separate regions, which is all done through a chaos monkey designed to replicate this failure.

While it is possible to sit down and anticipate some of the issues you can expect when a system fails it knowing what actually happens is another thing.

The result of this is you are forced to design and build highly fault tolerant systems and to withstand massive outages with minimal downtime. Expecting your systems to not have 100% uptime and planning accordingly to avoid this can be a tremendous competitive advantage.

One thing commonly overlooked with chaos engineering is its ability to find issues caused by cascading failure. You may be confident that your application still works when the database goes down, but would you be so sure if it when down along with your caching layer?

Should I be Chaos Testing?

This really depends on what your tolerances for failure are and based on the likely hood of them happening. If you are writing desktop software chaos testing is unlikely to yield any value. Much the same applies if you are running a financial system where failures are acceptable so long as everything reconciles at the end of the day.

If however you are running large distributed systems using cloud computing (think 50 or more instances) with a variety of services and process’s designed to scale up and out injecting some chaos will potentially be very valuable.

How to start Chaos Testing?

Thankfully with cloud computing and the API’s provided it can be relatively easy to begin chaos testing. These tools by allowing you to control the infrastructure through code allow the replication of a host of errors not easily reproducible when running bare hardware. This does not mean that bare hardware systems cannot perform chaos testing, just that some classes of errors will be harder to reproduce.

Lets start by looking at the way Microsoft and Netflix classify their “monkey’s”.

Low chaos
: This refers to failures that our system can recover from gracefully with minimal or no interruption to service availability.

Medium chaos
: Are failures that can also be recovered from gracefully, but may result in degraded service performance or availability.

High chaos
: Are failures that are more catastrophic and will interrupt service availability.

Extreme chaos
: Are operations are failures that cause ungraceful degradation of the service, result in data loss, or that simply fail silently without raising alerts.

Microsoft found that by setting up a testing environment and letting the monkey’s loose that they were able to identify a variety of issues with provisioning instances and services as well as scaling them to suit. They also split the environments into periods of chaos where the monkey’s ran and dormant periods where they did not. Errors found in dormant periods were considered bugs, and flagged to be investigated and fixed. During chaos periods any low issues were also considered bugs and scheduled to be investigated and fixed. Medium issues raised low priority issues to on call staff to investigate along with high level issues. Extreme operations once identified were not run again until a fix had been introduced.

The process for fixing issues identified through this process was the following,

* Discover the issue, identify the impacts if any and determine the root cause.
* Mitigate the issue to prevent data loss or service impact in any customer facing environments
* Reproduce the error through automation
* Fix the error and verify through the previous step it will not reoccur

Once done the monkey created through the automation step could be added the the regular suite of tests ensuring that whatever issue was identified would not occur again.

Netflix uses a similar method for fixing issue, but by contrast run’s their monkey’s in their live environments rather then in a pure testing environment. They also released some information on some of the monkey’s they used to introduce failures.

Latency Monkey
: Induces artificial delays into the client-server communication layer to simulate service degradation and determine how consumers respond in this situation. By making very large delays they are able to simulate a node or even an entire service downtime. This can be useful as bringing an entire instance down can be problematic when an instance hosts multiple services and when it is not possible to do so through API’s.

Conformity Monkey / Security Monkey
: Finds instances that don’t adhere to best-practices and shuts them down. Examples for this would be checking that instances in AWS are launched into permission limited roles and if they are not shutting them down. This forces the owner of the instance to investigate and fix issues. Security monkey as an extension that performs SSL certificate validation / expiry and other security best practice checks.

Doctor Monkey
: Checks existing health checks that run on each instances to detect unhealthy instances. Unhealthy instances are removed from service.

Janitor Monkey
: Checks for unused resources and deletes or removes them.

10-18 Monkey (Localisation monkey)
: Ensures that services continue to work in different international environments by checking that languages other then the base system consisting to work

Chaos Gorilla
: Similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.

Well hopefully that explains what Chaos Testing / Engineering is for those who were previously unsure. Feel free to contact me over twitter or via the comments for further queries or information!