Collection of my favorite optimization posts and articles

A collection of my favorite posts to read and re-read about optimizing code to an extreme. Unlikely that I will ever need to go to the extremes that these very talented individuals go to but its nice to learn the techniques.

In no particular order.

Will attempt to keep this list up to date as I find other content that really impresses me.

Simply encrypt or decrypt a string using Boto3 Python and AWS KMS

Another one of those things I need to look up every now and then. Below is a snippet of how to encrypt and decrypt a string using Python and KMS in AWS. The interesting thing is that you don’t need to supply the KMS key alias in the decryption portion. So long as whatever role or key you are using can access the key it should work. For the encryption you can either supply the full ARN of the key or the alias so long as you prefix it with alias/

import base64
import boto3

def encrypt(session, secret, alias):
    client = session.client('kms')
    ciphertext = client.encrypt(
    return base64.b64encode(ciphertext["CiphertextBlob"])

def decrypt(session, secret):
    client = session.client('kms')
    plaintext = client.decrypt(
    return plaintext["Plaintext"]

session = boto3.session.Session()
print encrypt(session, 'something', 'alias/MyKeyAlias')
print decrypt(session, 'AQECAINdoimaasydoasidDASD5asd45')

Setup up ConcourseCI 2.6.0 behind Nginx with Self Signed Certificates on Ubuntu 16.04

Concourse CI is a very nice continuous integration server.

However for installs there are a few gotcha’s you need to keep in mind. Mostly these relate to how TLS/SSL works.

The first is that while it is possible to run concourse inside Docker I found this to cause a lot of issues with workers dying and not recovering. I would suggest installing the binarys on bare machines. When I moved from a docker cluser using Amazon’s ECS to a single t2.large instance not only were builds faster but it was a far more reliable solution.

I am also not going to automate this install, and will leave it as an excercise for you the reader to do this yourself. I would suggest using Python Fabric, or something like Puppet, Ansible or Saltstack to achive this.

Also keep in mind that with this install everything is running on a single instance. If you have need to scale out this is not going to work, but as a way to get started quickly it works pretty well.

Prerequisites are that you have a Ubuntu instance running somewhere. If you want to run the fly execute command you will also need a valid domain name to point at your machine. This is an annoying thing caused by GoLang when using SSL certs. Turns out you cannot set a hostfile entry and use it as such. You can in a insecure non SSL mode but otherwise cannot.

If you are using a virual machine from DigitalOcean/AWS/Vultr or other you will need to add some swap space. I noticed a lot of issues where this was missing. You can do so by running the following commands which will configure your server to have 4G of swap space,

sudo fallocate -l 4G /swapfile
 sudo chmod 600 /swapfile
 sudo mkswap /swapfile
 sudo swapon /swapfile
 sudo echo "/swapfile none swap sw 0 0" >> /etc/fstab

We will need to get the concourse binary, and to make it executable. For convenience and to match the concourse documentation lets also rename it to concourse. To do so run the following command.

wget && mv concourse_linux_amd64 concourse && chmod +x concourse

We now need to generate the keys that concourse requires.

mkdir keys
 cd keys

ssh-keygen -t rsa -f tsa_host_key -N '' && ssh-keygen -t rsa -f worker_key -N '' && ssh-keygen -t rsa -f session_signing_key -N '' && cp authorized_worker_keys
 cd ..

The above commands will create a directory called keys and setup all of the keys that concourse 2.6.0 requires.

We can now create some helper scripts which we can use to run concourse easily.


./concourse web \
 --basic-auth-username main \
 --basic-auth-password MySuperPassword \
 --session-signing-key ./keys/session_signing_key \
 --tsa-host-key ./keys/tsa_host_key \
 --tsa-authorized-keys ./keys/authorized_worker_keys \
 --external-url https://YOURDNSHERE/ \
 --postgres-data-source postgres://concourse:concourse@

chmod +x

This script will start running concourse. Keep in mind that the username and password used here are for the main group and as such you should protect them as they have the ability to create additional groups on your concourse instance.


./concourse worker \
 --work-dir /opt/concourse/worker \
 --tsa-host \
 --tsa-public-key ./keys/ \
 --tsa-worker-private-key ./keys/worker_key

chmod +x

This script will spin up a worker which will communicate with the main concourse instance and do all the building. It can be useful to lower the priority of this command using nice and ionice if you are running on a single core machine.

Now we need to install all of the postgresql packages required,

apt-get update && apt-get install -y postgresql postgresql-contrib

Once this is done we can create the database to be used

sudo -u postgres createdb concourse

Then login to postgresql and create a user to connect to the database

sudo -u postgres psql
 CREATE USER concourse WITH PASSWORD 'concourse'
 GRANT ALL PRIVILEGES ON DATABASE "concourse" to concourse

We also need to need to edit the pg_hba file allowing us to make the connection,

sudo pico /etc/postgresql/9.5/main/pg_hba.conf

Scroll down and look for the following line,

host all all md5

and change the md5 on the end to trust

host all all trust

Then save the file and restart postgresql

service postgresql restart

At this point everything we need to run concourse should be there. You will need to setup the concourse scripts we created earlier to run as a service, or just run them in a screen session if you are in a hurry.

What we want to do now is expose it to the big bad internet.

apt-get install nginx

Create a directory using either the domain name you want to use, a desired name or anything if you are going to connect to things using IP addresses.

mkdir -p /etc/nginx/ssl/

Nowe we want to swtich to the directory and setup the self signed TLS/SSL keys.

cd /etc/nginx/ssl/
 openssl genrsa -des3 -out server.key 1024

Enter whatever you want for the passphrase but remember it!

openssl req -new -key server.key -out server.csr

Enter the passhrase entered. The most important thing here is that when asked for the Common Name or FQDN you need to enter in the desired domain name.

With that done we need to sign the key.

cp server.key
 openssl rsa -in -out server.key

Remember to enter the same pass phrase as before. Finally sign they key with an expiry of 9999 days.

openssl x509 -req -days 9999 -in server.csr -signkey server.key -out server.crt

Make a copy of the file server.crt which will be needed for the concourse fly tool to talk to the server if you are using self signed certs.

With that done lets enable the site,

sudo nano /etc/nginx/sites-available/

And enter in the following details,

upstream concourse_app_server {
 server localhost:8080;

server {
 listen 80 default_server;
 rewrite ^ https://MYIPADDRESSORDOAMIN$request_uri? permanent;

server {
 listen 443 default_server ssl http2;

ssl on;
 ssl_certificate /etc/nginx/ssl/;
 ssl_certificate_key /etc/nginx/ssl/;

location / {
 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 proxy_set_header Host $http_host;
 proxy_redirect off;
 proxy_pass http://concourse_app_server;

The above nginx config defines an upstream concourse server running on port 8080. It then defines a server listening on port 80 that redirects all traffic to the same server on HTTPS. The last server config defines out site, sets up the keys and forwards everything to the upstream concourse server.

We can now enable it,

sudo ln -s /etc/nginx/sites-available/ /etc/nginx/sites-enabled/
 rm -f /etc/nginx/sites-available/default

service nginx restart

We need to remove the default file above because nginx will not allow you to have two default servers.
At this point everything should be working. You should now be able to connect to your concourse server like so,

fly --target myciserver login --team-name myteamname --concourse-url https://MYIPADDRESSORDOAMIN --ca-cert ca.crt

Where the file ca.crt exists whereever you are running fly. Everything at this point should work and you can browse to your concourse server.

If you are using the IP address to communicate to your concourse server you have the limitation that you will not be able to run fly execute to upload a task. You can get around this by using a real domain, or running your own DNS to resolve it correctly.

The only bit of homework at this point would be to configure the firewall to disable access to port 8080 so that everything must go though nginx. Enjoy!

Sphinx Real Time Index How to Distribute and Hidden Gotcha

I have been working on real time indexes with Sphinx recently for the next version of and ran into a few things that were either difficult to search for or just not covered anywhere.

The first is how to implement a distributed search using real time indexes. It’s actually done the same way you would normally create an index. Say you had a single server with 4 index shards on it and you wanted to run queries against it. You could use the following,

index rt
    type = distributed
    local = rt1
    agent = localhost:9312:rt2
    agent = localhost:9312:rt3
    agent = localhost:9312:rt4

You would need to have each one of your indexes defined (only one is added here to keep the example short)

index rt1
    type = rt1
    path = /usr/local/sphinx/data/rt1
    rt_field = title
    rt_field = content
    rt_attr_uint = gid

Using the above you would be able to search across all of the shards. The trick is knowing that to update you need to update each shard yourself. You cannot pass documents to the distributed index but instead must make a separate update to each shard. Usually I split sphinx shards based on a query like the following,

SELECT cast((select id from table order by 1 desc limit 1)/4 as UNSIGNED)*2, \
         cast((select id from table order by 1 desc limit 1)/4 as UNSIGNED)*3 \
         FROM table limit 1

Where the 4 is the number of shards and the multiplier splits the shards out. It’s performant due to index use. However for RT I suggest a simple modulas operator % against the ID column for each shard as it allows you to continue to scale out to each shard equally.

The second issue I ran into was that when defining the attributes and fields you must define all the fields before the uints. The above examples work fine but the below is incorrect. I couldn’t find this mentioned in the documentation.

index rt
    type = rt
    path = /usr/local/sphinx/data/rt
    rt_attr_uint = gid # this should be below the rt_fields
    rt_field = title
    rt_field = content

Explaining VarnishHist – What Does it Tell Us

The varnishhist tool is one of the most underused varnish tools that come with your standard varnish install. Probably because of how it appears at first glance.

In short, you want as many | symbols as possible and you want everything far toward the left hand side. The closer to the left the faster the responses are regardless if they are cached or not. The more | symbols then more items were served from cache.

A small guide,

'|' is cache HIT
'#' is cache MISS
'n:m' numbers in left top corner is vertical scale
'n = 2000' is number of requests that are being displayed (from 1 to 2000)

The X-axis is logarithmic time between request request from kernel to Varnish and response from Varnish to kernel.

The times on the X-axis are as such,

1e1 = 10 sec
1e0 = 1 sec
1e-1 = 0.1 secs or 100 ms (milliseconds)
1e-2 = 0.01 secs or 10 ms
1e-3 = 0.001 secs or 1 ms or 1000 µs (microseconds)
1e-4 = 0.0001 secs or 0.1 ms or 100 µs
1e-5 = 0.00001 secs or 0.01 ms or 10 µs
1e-6 = 0.000001 secs or 0.001 ms or 1 µs or 1000 ns (nanoseconds)

Below is the varnishhist for showing that while most responses are served in about 100ms not many are cached. This can mean one of a few things.

  • The responses are not cache-able and you need to adjust the back-end responses to have the correct headers (or override the settings with VCL config).
  • The cache timeout for the back-end responses isn’t high enough to ensure that later requests are served from cache.
  • There isn’t a large enough cache to hold all the responses (that’s the problem in this case).
1:20, n = 2000

            |                    ####
            |                    ####
            |                    ####
            |                    #####
            |                    #####
            |                    #####
            |                   #######
            |                   #######
            ||  |    #      #   ##########
|1e-6  |1e-5  |1e-4  |1e-3  |1e-2  |1e-1  |1e0   |1e1   |1e2

MySQL Dump Without Impacting Queries

Posted more for my personal use (I have to look it up every time) but here is how to run a mysqldump without impacting performance on the box. It sets the ionice and nice values to be as low as possible (but still run) and uses a single transaction and ups the max packet size for MySQL.

ionice -c2 -n7 nice -n19 mysqldump -u root -p DATABASE --single-transaction --max_allowed_packet=512M > FILENAME

Exporting Documents from KnowledgeTree

I was recently tasked with exporting a large collection of documents from KnowledgeTree (KT) for a client. The collection was too large to use the download all functionality and too wide to attempt to export each folder individually.

I had played around with the WebDav connection that KT provides but it either didn’t work or was designed deliberately to not allow exporting of the documents.

I looked at where the documents were  stored on disk but KT stores them as numbered files in numbered directories sans extension or folder information.

Long story short I spent some time poking through the database to identify the tables which would contain the correct metadata which would allow me to rebuild the tree using a proper filesystem. For record the tables required are the following,

  • folders – Contains the folder tree. Each entry represents a folder and contains its parent folder id.
  • documents – Contains the documents that each folder contains. Knowing the folders id you can determine what documents live in that folder.
  • document_content_version – Contains the metadata required to get the actual file from disk. A 1 to 1 mapping between document id and this table is all that is required.

That said here is a short Python script which can be used to rebuild the folders and documents on disk. All that is required is to ensure that Python MySQLdb is installed and to set the database details. Depending on your KT install you may need to change the document location. Where  the script is run it will replicate the folder tree containing the documents preserving the structures, names and extensions.

Keep in mind this is a fairly ugly script abusing global variables and such. It is also not incredibly efficient, but did manage to extract 20GB of files in my case in a little under 10 minutes.

import MySQLdb
import os
import shutil

# KnowledgeTree default place to store documents
ktdocument = '/var/www/ktdms/Documents/'

conn = MySQLdb.connect(user='', passwd='',db='', charset="utf8", use_unicode=True)
cursor = conn.cursor()

# global variables FTW
cursor.execute('''select id, parent_id, name from folders;''')
allfolders = cursor.fetchall()

cursor.execute('''select id, folder_id from documents;''')
alldocuments = cursor.fetchall()

cursor.execute('''select document_id, filename, storage_path from document_content_version;''')
document_locations = cursor.fetchall()

# create folder tree which matches whatever the database suggests exists
def create_folder_tree(parent_id, path):
    directories = [x for x in allfolders if x[1] == parent_id]
    for directory in directories:
        d = '.%s/%s/' % (path, directory[2])
        print d
        # get all the files that belong in this directory
        for document in [x for x in alldocuments if x[1] == directory[0]]:
                location = [x for x in document_locations if document[0] == x[0]][0]
                print 'copy %s%s %s%s' % (ktdocument, location[2], d, location[1])
                shutil.copy2('%s%s' % (ktdocument, location[2]), '%s%s' % (d, location[1]))
                 print 'ERROR exporting - Usually due to a linked document.'

        create_folder_tree(parent_id=directory[0], path='%s/%s' % (path, directory[2]))

create_folder_tree(parent_id=1, path='')

C# XML Cleaner Regex

One of the most annoying things I deal with is XML documents with invalid characters inside them. Usually caused by copy pasting from MS Word it ends up with invisible characters that you cannot easily find and cause XML parsers to choke. I have encountered this problem enough that I thought a quick blog post would be worth the effort.

As such here mostly for my own reference is a regular expression for C# .NET that will clean invalid XML characters from any XML file.

const string InvalidXmlChars = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";

Simply take the XML document as a string and do a simple regular expression replace over it to remove the characters. You can then import into whatever XML document structure you want and process it as normal.

For note the reason I encountered this was when I was building a series of WSDL web-services over HR data. Since the data had multiple sources merged during an ETL process and some of them were literally CSV files I hit this issue a lot. Attempts to sanitize the data failed as it was overwritten every few hours and it involved multiple teams to change the source generation. In the end I just ran the cleaner over every field before it was returned and everything worked perfectly.

Regular Expressions are Fast, Until they Aren’t

TL/DR: Regular expressions are fast, until they aren’t. How I got a 20x performance by switching to string functions.

With the new version of one of the main things I wanted to address was performance. The previous version had all sorts of performance issues which were not limited to the usual suspects such as the database or search index.

When developing the new version one of the tasks listed in my queue was to profile search operations for anything slowing things down. Sadly I have since lost the profile output but observed that one of the main speed culprits is the format_results function inside the code model. For most queries I tried while it was the slowest operation it wasn’t worth optimising simply because its impact was so low. I did however keep it in the back of my mind that if there were any performance issues it would be the first thing to look at.

The final step in any profiling however is to load some production data and run a load test. My personal preference being to generate a heap of known URL’s and tell a tool like Siege to go crazy with them. The results showed that 95% of pages loaded very quickly, but some took over 30 seconds. These instantly prompted further investigation.

A quick look at the profiler showed that the “back of mind function” was now becoming a serious issue. It seemed that for some regular expressions over some data types the worst case time was very bad indeed.

All of a sudden that function has become a major bottleneck.This needed to be resolved, to fix the worst case without selling out the best case which was very fast indeed. To get an understanding of what the function does you need to understand how searchcode works. When you search for a function of snippet searchcode tries to search for something that matches exactly what you are looking for first, and something containing anything in your query second. This means you end up with two match types, exact matches and fuzzy matches. The results are then processed by firstly trying to match the query exactly, and then going for a looser match.

This was implemented initially though two regular expressions like the below,

exact_match = re.compile(re.escape(search_term), re.IGNORECASE)
loose_match = re.compile('|'.join([re.escape(x) for x in search_term.split(' ')], re.IGNORECASE)

As you can see they are compiled before being handed off to another function which uses them for the actual matching. These are fairly simple regular expressions with the first just looking for any match and the second a large OR match. Certainly you would not expect them to cause any performance issues. Reading the following on stack overflow regarding the differences certainly seems to suggest that unless you are doing thousands of matches the performance should be negligible.

At heart searchcode is just a big string matching engine, and it does many thousands or hundreds of thousands of match operations for even a simple search. Since I don’t actually use of the power of regular expressions the fix is to change the code so that we pass in an array of terms to search for and use a simple Python in operator check.

exact_match = [search_term.lower()]
loose_match = [s.lower().strip()
                for s in search_term.split(' ')]

The results? Well remember we want to improve the worst case without selling out the best case, but the end result was pages that were taking nearly a minute to return were coming back in less than a second. All other queries seemed to come back either in the same time or faster.

So it turns our regular expressions are fast most of the time. Certainly I have never experienced and performance issues with them up till now. However at edge cases like the one in searchcode you can hit a wall and its at that point you need to profile and really re-think your approach.

Implementing C# Linq Distinct on Custom Object List

Ever wanted to implement a distinct over a custom object list in C# before? You quickly discover that it fails to work. Sadly there is a lack of decent documentation about this and a lot of FUD. Since I lost a bit of time hopefully this blog post can be picked up as the answer.

Thankfully its not as difficult as you would image. Assuming you have a simple custom object which contains an Id, and you want to use that Id to get a distinct list all you need to do is add the following to the object.

public override bool Equals(object obj)
	return this.Id == ((CustomObject)obj).Id;

public override int GetHashCode()
	return this.Id.GetHashCode();

You need both due to the way that Linq works. I suspect under the hood its using a hash to work out whats the same hence GetHashCode.