Home Battery Systems – You may de-rate system capacity

A quick post. I was looking at the prices to install a home battery system the other day. During the sales process I was informed that for a modular system such as the one provided by Ampetus that if you don’t buy 2 or more batteries that

“you may de rate system capacity”

Sound’s impressively scary! Slightly worried I asked that it means. Thankfully what it actually means is that instead of getting 3kW delivery using two batteries that one battery matched with the inverter may only deliver 1.5kW. This is not a problem if you are still connected to the grid but it can mean that your inverter isn’t working at its full potential. So in short nothing to worry about.

How to identify software licenses using Python, Vector Space Search and Ngram Keywords

The below is mostly a log of my thought process while building out some functionality that I wanted to add into searchcode server. I kept a record of progress and thoughts while doing this in the hopes that I get some sort of useful blog post out of it. It has been edited somewhat for clarity.

One of the tickets raised for searchcode server is the ability to to filter by licence https://github.com/boyter/searchcode-server/issues/96 yes, I am the one who raised the ticket, but it is based on requests from customers. It will also put searchcode server closer to feature parity with Krugle so it seems like a good thing to add.

This means that ideally all I want is a script that given a projects directory and a filename can tell me what licenses it is likely to be under. I can then port that over to searchcode-server and its Java codebase at my leisure.

My first thought was that adding it for say the top 20 most popular licenses shouldn’t be too difficult, and its something that I could always expand on later if things are working well.

As such the first thing I needed was to determine the top software licenses in use which thankfully Blackduck Software has already done https://www.blackducksoftware.com/top-open-source-licenses

Then I needed to get a copy of them and in a nice format such as JSON to allow for processing. I decided to source them from SPDX.org since generally they should be considered the source of truth for license information. https://spdx.org/licenses/

I then wrote a simple script download.py to pull all the licenses down (yes using regex to pull information out of HTML which is BAD but I am not trying to parse it so I am pretty sure he won’t come) locally and allow for further processing. Sorry SPDX people about crawling your site in a non robots compliant way. I did try to get what I needed and cache it locally so I suspect I impacted the site less than Googlebot would have.

I also added a simple filter to pull back the top 20+ licences that we previously identified so we can test things out.

The next step was to turn the HTML into JSON. Again an ugly script parse.py which calls the beast by using regex to pull information out of HTML. One thing I did notice is that the Fair Source licence was missing. Since I was planning on using searchcode server as a test bed I needed that and added it myself.

Once again sorry to the SPDX people. If you come to Sydney and ill buy you a beer and apologize for,

  1. For creating my own short-name for the Fair Source License without permission
  2. For the crappy internet you will experience (Seriously search for Turnbull NBN if you want to see how backwards a country can become, copper for life yo)

The result is now we have a database of about 20 licences that we can use to try and determine what licence a software project is.

Attempt 1

To start with I made the following assumptions.

A file with a name such as LICENSE or COPYING is likely to exist in the root folder of any project and contain license information. If not have a look inside the readme (if it exists). This file if found and it has a license that we can identify then it becomes the base license for all files inside the project. However files inside the project itself may have a header which must also be considered.

One thing that I didn’t consider at first is that there may be another license/copying file somewhere deeper inside the file tree. Something to consider for later.

First thought was to just use the vector space search algorithm. It really is my hammer for every problem. It works reasonably well for a lot of them, is fast enough in most cases and generally gets the job done. Thankfully I had already written one for Python a while ago. Don’t hate please, I wrote this 10 years ago when I was first learning it and yes its not Pythonic but works. One thing to note is that licenses tend to be of different length. This means that licenses that are closer to each other in length will be matched more closely using the vector space which is a cool side effect of how it works.

So the algorithm is,

Find likely candidates for license files.
Compare them to the list of known licenses using the vector space model.
Keep the most likely match.

Then walk through every file in the repository, checking for the special header and when there are no matches try the full header because things like the entire MIT license tends to get included at the top of the file. The resulting ugly script attempt1.py produced the following.

The result for searchcode-server

Project License
0.929696395964 Fair Source License v0.9
0.802095284986 Mozilla Public License 2.0 (no copyleft exception)
0.802095284986 Mozilla Public License 2.0

0.925341443302 MIT License /searchcode-server/src/main/resources/public/js/cache.js

Wow. That is a very cool result. It actually worked. It not only picked up that we are using the Fair Source Licence it also picked up that that one file is using the MIT license. Lets try it out on some other projects.

Against wordpress,

Project License
0.623430083878 GNU General Public License v2.0 only
0.614318516008 GNU General Public License v1.0 only
0.601642491832 GNU Library General Public License v2 only

While it did pick up that its probably using GNU GPL v2.0 but it wasn’t as confident as it was with the previous. Lets try another one.

Against minitwit (a Spark Java example project)

Project License
0.954897366777 MIT License
0.784597744861 Fair Source License v0.9
0.777231345803 Apache License 2.0

Not bad. Its pretty confident that it us under the MIT license.

Against Armory React

Project License
0.945769202843 BSD 3-clause Clear License
0.937649791859 BSD with attribution
0.927894236317 BSD 2-clause "Simplified" License

Again pretty good.

Ok. So it looks like we could just pop the top license off and call it a day. This works pretty well with the most common licenses, but why limit ourselves. Lets just do every license that SPDX has listed? It should work just as well in theory. All we need to do is remove the filter in download.py and rebuild the database and try again.

Trying out on searchcode-server again

Project License
0.929696395964 Fair Source License v0.9
0.813818153434 OCLC Research Public License 2.0
0.804549095187 OSET Public License version 2.1

0.977617793941 BSD Zero Clause License /searchcode-server/include/license/database.json
0.939606278132 Attribution Assurance License /searchcode-server/include/license/database.json
0.908192569643 Open CASCADE Technology Public License /searchcode-server/include/license/database.json
0.902275136399 Adaptive Public License 1.0 /searchcode-server/include/license/database.json
0.93217139424 JSON License /searchcode-server/src/main/resources/public/js/cache.js
0.925341443302 MIT License /searchcode-server/src/main/resources/public/js/cache.js
0.914039614281 feh License /searchcode-server/src/main/resources/public/js/cache.js

Ok so it still picked up fair source as the main project license which is a good result. However our very cool result of MIT being found in cache.js has gone away. Apparently the JSON license looks like the MIT license to the vector space. Indeed a diff between them shows that they are almost the same. The differences being right at the start,

MIT License Copyright (c)
JSON License Copyright (c) 2002 JSON.org

and buried in the middle of the JSON license

The Software shall be used for Good, not Evil.

Hah! I remember reading about that a while ago. Something about Google not being able to use it because apparently their motto “Don’t be evil” is more a guideline then a rule. In all seriousness though its actually due to the license being classified as a non-free license because it imposes conditions which restrict the usage see https://www.cnet.com/news/dont-be-evil-google-spurns-no-evil-software/ for more details.

So what we would normally do about now is add keyword weighting to the terms so that in this case MIT makes it rank higher for MIT and JSON for JSON. In fact I started doing just that with keyword it then realised with 328 licenses this is going to be a painful process. Perhaps there is a better way.

Thinking about it what we really want is to find keywords or a collection of multiple keywords that we know to be unique for each license. Then all we need to is check the text for the presence of those keywords. The catch being we need to ensure that they are unique for each license. To do so what I think will work is break the license up into collections of works of length 1-10 and check for unique-ness against the other licenses. The technical term for these terms is ngram.

An example would be,

Lorem ipsum dolor sit amet consetetur sadipscing elitr

bi-grams

[('lorem', 'ipsum'), ('ipsum', 'dolor'), ('dolor', 'sit'), ('sit', 'amet'), ('amet', 'consetetur'), ('consetetur', 'sadipscing'), ('sadipscing', 'elitr')]

tri-grams

[('lorem', 'ipsum', 'dolor'), ('ipsum', 'dolor', 'sit'), ('dolor', 'sit', 'amet'), ('sit', 'amet', 'consetetur'), ('amet', 'consetetur', 'sadipscing'), ('consetetur', 'sadipscing', 'elitr')]

Thankfully this is pretty easy to do in Python so I borrowed an existing bit of code to do it for me, http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/

input_list = ['all', 'this', 'happened', 'more', 'or', 'less']

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

For the record I am totally aware that NTLK can also do this but since I don’t currently have that installed lets go pure Python. It will be a little slower but considering this should rarely run this calculation I am not too worried about performance yet. Of course thats a good idea to live by anyway. Only worry about performance when it becomes an issue.

We can then generate ngrams for each license, then check for its uniqueness in every other one. If no matches found then woohoo we have a gram that uniquely matches the license.

Some simple but stupid code was written parse2.py which does exactly this. Turns out that language is a lot more distinctive for licenses then I first thought,

0BSD 188
AAL 1985
Abstyles 437
Adobe-2006 1234
Adobe-Glyph 1620
ADSL 608
AFL-1.1 555
AFL-1.2 251
AFL-2.0 69
AFL-2.1 67
AFL-3.0 452
Afmparse 959
AGPL-1.0 2212
...truncated...

For the BSD Zero Clause License there are apparently 188 unique ngrams between a length of 2-10 words in it. For the Affero General Public License v1.0 there are a whopping 2212! The numbers were so high that I changed the ngrams to start at 5 to 10. This dropped the numbers found by about 25% which seems about right as you would expect the most unique combinations of words to exist at the upper range of ngrams.

One problem I noticed with this is that a lot of the ngrams are based on names that exist within the sample licenses that SPDX has. For example BSD Zero Clause has the name “Rob Landley” which produces a lot of ngrams with this name in it as is indeed unique, but is useless unless we happen to be checking code that Rob has written.

However the performance issue that I was worried about before popped up. it was taking a long time to process. Not surprising considering the actual implementation consists of multiple nested loops. With the encouraging result that there is a lot of uniqueness for most licenses I tried just searching for ngrams of 4-5 words long to speed things up. Assuming we didn’t find 0 matches then happy days we can try implementing using what we have and everything should be fine. A few small tweaks later and I ran it again.

Doh! As it turns out some are not unique enough. The culprits,

Artistic-1.0
BSD-3-Clause
MIT-CMU
MPL-1.1
MPL-2.0-no-copyleft-exception
MPL-2.0

I modified the code to just loop those ones with a more exhaustive search to find what worked. With ngrams of length 2-10 still there was not enough uniqueness. So I went all out, ngrams from length 2-35.

Artistic-1.0 120
BSD-3-Clause 21

This resolved the issue for Artistic-1.0 and BSD-3-Clause but we still have nothing for the following licenses,

MIT-CMU
MPL-1.1
MPL-2.0-no-copyleft-exception
MPL-2.0

Something is wrong here.

Turns out some of these are particularly troubling. The Mozilla Public License 1.1 (MPL-1.1 in the above) for example seems to be embedded  in the Netscape Public License v1.1 which of course means nothing about it is unique compared to the other. The others have a similar story. I tried searching just between both those licences with ngrams of length 2-100 to see if anything would turn up, which of course took a crazy amount of time to process. The result was that there was nothing that could be used keyword wise to seperate these licenses.

How about a hybrid approach?

If we cannot definitively say using keywords what license the file is, then we can fall back to the vector space to rank based on those without keywords. To do so I flushed the results of the last run into the database file using 7-8 ngrams for everything except Artistic-1.0 and BSD-3-Clause which checked for ngrams with a range of 2-35. The results of this produced the following licenses with their number of unique ngrams,

0BSD 25
AAL 246
Abstyles 54
Adobe-2006 163
Adobe-Glyph 210
ADSL 78
...truncated...
Fair-Source-0.9 267

Most seem to have over 100 or so unique ngrams which means they will probably work pretty well, and as expected the only exceptions are the MPL licenses which has nothing unique about it compared to every other license.

I ended up cleaning up the code at this point and improving on the loops so thankfully it took only a few minutes to run. Some of the original runs were taking tens of minutes due to inefficient code so this was a big win. The resulting file was on the heavy side though at 20 megabytes and crashed most editors when I tried to read it. To resolve this I truncated down to at most 50 unique ngrams per license to see how this worked in the real world. This file weighed in a much more realistic 3 megabytes.

I then created attempt2.py which used the new database using keywords to guess which license the applications had applicable. Firstly I tried it against searchcode-server itself. With the result that the LICENCE file found was indeed the Fair Source License. I then tried it against GCC. Its interesting to note that this resulted in its COPYING file being marked as containing both the GPL-2.0 and LGPL-2.1. At first I thought this might have been a bug in my logic but it seems it was actually correct. The offending ngram used to match was

"street, fifth floor, boston, ma 02110-1301 usa"

Which is supposed to be unique to LGPL-2.1 but included in this case. Since we actually have 813 (but truncated to 50) ngrams for each I figured we might as well when checking see if MOST (70%) of the keywords are there, and if so mark it as a match otherwise ignore. This resutled in the following for GCC

COPYING GPL-2.0
COPYING.LIB GPL-2.0
COPYING.RUNTIME GPL-2.0
COPYING3 GPL-3.0
COPYING3.LIB GPL-3.0

Which is correct. In fact further tests on various other repositories worked until I hit the react-armory which is under the BSD Clear License. Turns out most of the ngrams generated for it actually involve the project name meaning they are useless. Annoyingly the ngram of length 3 “BSD Clear License” which is unique was missed because the parser was set to look from 7-8 ngrams. Urgh.

Thinking about this for a bit, it totally makes sense to look for ngrams from 2-8 even when we are truncating to the top 50. The smaller ones are going to be high value ones usually and there shouldn’t be too many of them. The only problem is the increased processing time to walk all of the combinations of things like [‘the’, ‘license’] and the like. At the very least we should include trigrams (3 length ngrams) since it solves this specific issue and should work to be unique for a lot of licenses. Modifying the parse2.py script to take in a range of numbers defined was easy enough then just let it run and produce the new database to work with and try everything again. It took longer than before but not so much that it was annoying.

Due to how painful this was getting to test manually I started to build a small test harness (test_attempt2.py) which I could eyeball to see if I was getting closer to a result I wanted without regressions. Thankfully it was working perfectly for the tests I was trying.

At this point I tried the same technique across all the files in repositories. Using the previous example of searchcode-server I would have expected that the cache.js file would be marked as MIT license as before. This however was not the case. Because I had the keyword match script as a sliding scale where it needed a few matches to be considered positive it was missing out on this one as there were only a few matches.

I decided at this time to integrate the Vector Space back in. If a file was marked with any license, we would then confirm using the Vector Space and if it was able to have a high degree of confidence over the license that matched then it would be marked as the license. After adding this logic I managed to get the following output,

Bens-MBP:license boyter$ python attempt2.py
0.988840442861 Fair-Source-0.9 /searchcode-server/LICENSE.txt
0.442713859889 MIT /searchcode-server/src/main/resources/public/js/cache.js

Exactly what I was looking for. All of the tests I had previously written also passed with this logic. With a simple change to only check the top of the file where the header would be by cutting the string down to the length of the license I was able to improve the result considerably,

Bens-MBP:license boyter$ python attempt2.py
0.988840442861 Fair-Source-0.9 /searchcode-server/LICENSE.txt
0.985719098155 MIT /searchcode-server/src/main/resources/public/js/cache.js

That is a seriously cool result. Not only was the code able to identify the licences, it did so with a very high percentage of confidence to boot.

I tried running it over a collection of projects I had checked out including GCC, Linux and WordPress

 0.993261478239 BSD-3-Clause-Clear /armory-react/LICENSE
 0.999679612780 GPL-3.0 /decodingcaptchas/LICENSE
 0.980796066053 OFL-1.1 /decodingcaptchas/lib/font/source-sans-pro/LICENSE
 0.999692799354 GPL-3.0 /freemoz/LICENSE
 0.999792347852 GPL-2.0 /gcc/COPYING
 0.999922926136 LGPL-2.1 /gcc/COPYING.LIB
 0.999679612780 GPL-3.0 /gcc/COPYING3
 0.999902746595 LGPL-3.0 /gcc/COPYING3.LIB
 0.999792347852 GPL-2.0 /gcc/gcc/COPYING
 0.999932589474 LGPL-2.1 /gcc/gcc/COPYING.LIB
 0.999679612780 GPL-3.0 /gcc/gcc/COPYING3
 0.999902746595 LGPL-3.0 /gcc/gcc/COPYING3.LIB
 0.858896713823 GPL-2.0 /gcc/gcc/df-core.c
 0.999532020106 GFDL-1.3 /gcc/gcc/ada/doc/share/gnu_free_documentation_license.rst
 0.948529328513 bzip2-1.0.6 /gcc/gcc/testsuite/gcc.c-torture/execute/pr20527-1.c
 0.994200247145 bzip2-1.0.6 /gcc/gcc/testsuite/gcc.dg/params/LICENSE
 0.999792347852 GPL-2.0 /gcc/include/COPYING
 0.999679612780 GPL-3.0 /gcc/include/COPYING3
 0.889767272339 bzip2-1.0.6 /gcc/libbacktrace/alloc.c
 0.899838703127 bzip2-1.0.6 /gcc/libbacktrace/atomic.c
 0.883378984629 bzip2-1.0.6 /gcc/libbacktrace/backtrace.c
 0.885232659466 bzip2-1.0.6 /gcc/libbacktrace/btest.c
 0.866517942966 bzip2-1.0.6 /gcc/libbacktrace/dwarf.c
 0.881126480326 bzip2-1.0.6 /gcc/libbacktrace/elf.c
 0.876148037216 bzip2-1.0.6 /gcc/libbacktrace/fileline.c
 0.862314849462 bzip2-1.0.6 /gcc/libbacktrace/mmap.c
 0.876559695369 bzip2-1.0.6 /gcc/libbacktrace/mmapio.c
 0.903997364404 bzip2-1.0.6 /gcc/libbacktrace/nounwind.c
 0.880387358104 bzip2-1.0.6 /gcc/libbacktrace/pecoff.c
 0.872402129061 bzip2-1.0.6 /gcc/libbacktrace/posix.c
 0.881550515104 bzip2-1.0.6 /gcc/libbacktrace/print.c
 0.878971997218 bzip2-1.0.6 /gcc/libbacktrace/read.c
 0.880857671076 bzip2-1.0.6 /gcc/libbacktrace/simple.c
 0.897738871764 bzip2-1.0.6 /gcc/libbacktrace/sort.c
 0.890041393843 bzip2-1.0.6 /gcc/libbacktrace/state.c
 0.874782494001 bzip2-1.0.6 /gcc/libbacktrace/stest.c
 0.897023027383 bzip2-1.0.6 /gcc/libbacktrace/unknown.c
 0.879821572348 MITNFA /gcc/libffi/src/mips/ffi.c
 0.997055163924 LGPL-2.1 /gcc/libiberty/copying-lib.texi
 0.999932589474 LGPL-2.1 /gcc/libiberty/COPYING.LIB
 0.899167675944 Intel /gcc/liboffloadmic/runtime/liboffload_error.c
 0.889543866358 Intel /gcc/liboffloadmic/runtime/liboffload_msg.c
 0.887357763796 Intel /gcc/liboffloadmic/runtime/orsl-lite/lib/orsl-lite.c
 0.999932589474 LGPL-2.1 /gcc/libquadmath/COPYING.LIB
 0.857337014417 NCSA /gcc/libsanitizer/LICENSE.TXT
 1.000000000000 BSL-1.0 /gcc/zlib/contrib/dotzlib/LICENSE_1_0.txt
 0.998470560414 GPL-2.0 /linux/COPYING
 0.999831352903 LGPL-2.0 /linux/arch/sparc/lib/COPYING.LIB
 0.997101938433 GPL-2.0 /linux/Documentation/networking/LICENSE.qlcnic
 0.997101938433 GPL-2.0 /linux/Documentation/networking/LICENSE.qlge
 0.997004626197 GPL-2.0 /linux/Documentation/scsi/LICENSE.qla2xxx
 0.997004626197 GPL-2.0 /linux/Documentation/scsi/LICENSE.qla4xxx
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxx/adf_c3xxx_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxx/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxxvf/adf_c3xxxvf_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxxvf/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62x/adf_c62x_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62x/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62xvf/adf_c62xvf_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62xvf/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_accel_engine.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_admin.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_aer.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_cfg.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_ctl_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_dev_mgr.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_hw_arbiter.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_init.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_isr.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_pf2vf_msg.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_sriov.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_transport.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_transport_debug.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_vf2pf_msg.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_vf_isr.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_algs.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_asym_algs.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_crypto.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_hal.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_uclo.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xcc/adf_dh895xcc_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xcc/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xccvf/adf_dh895xccvf_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xccvf/adf_drv.c
 0.998958351972 GPL-2.0 /linux/drivers/staging/rtl8192e/license
 0.999594583136 GPL-2.0 /linux/drivers/staging/rtl8192u/copying
 0.999792347852 GPL-2.0 /linux/tools/usb/usbip/COPYING
 0.952930843666 Sleepycat /pathos/dill-0.2.1/LICENSE
 0.952930843666 Sleepycat /pathos/pathos-master/LICENSE
 0.952930843666 Sleepycat /pathos/pox-0.2/LICENSE
 0.987991559152 MIT /phonecat/LICENSE
 1.000000000000 Apache-2.0 /python-goose/LICENSE.txt
 1.000000000000 Apache-2.0 /searchcode/searchcode/searchcode/static/admin/fonts/LICENSE.txt
 0.987991559152 MIT /searchcode/searchcode/searchcode/static/admin/js/vendor/xregexp/LICENSE-XREGEXP.txt
 0.988840442861 Fair-Source-0.9 /searchcode-server/LICENSE.txt
 0.985719098155 MIT /searchcode-server/src/main/resources/public/js/cache.js
 0.985719098155 MIT /searchcode-server/target/classes/public/js/cache.js
 0.999268167835 LGPL-2.1 /seo/vendor/videlalvaro/php-amqplib/LICENSE
 0.999692799354 GPL-3.0 /SingleBugs/LICENSE
 0.999692799354 GPL-3.0 /testing/LICENSE
 0.997865652460 GPL-2.0 /wordpress/license.txt
 0.999792346238 GPL-2.0 /wordpress/wp-content/themes/twentyfourteen/genericons/LICENSE.txt
 0.999792346238 GPL-2.0 /wordpress/wp-content/themes/twentythirteen/fonts/LICENSE.txt
 0.999792346238 GPL-2.0 /wordpress/wp-includes/js/plupload/license.txt
 0.999932589474 LGPL-2.1 /wordpress/wp-includes/js/tinymce/license.txt

Rather cool. Lots of licenses identified with quite a lot of confidence.

There is however one issue I am overlooking. Take for example a project which includes another project in a sub directory. What license should be displayed for the files that are in the latter? In this case I would expect that they are marked with both. It would be very nice to see this information.

One final issue is what to do when there are multiple license files in the root directory. GCC is an example of this and it has 4 licenses defined in the root. The idea being you take which one is most applicable to your project.

0.999792347852 GPL-2.0 /gcc/COPYING
0.999922926136 LGPL-2.1 /gcc/COPYING.LIB
0.99967961278 GPL-3.0 /gcc/COPYING3
0.999902746595 LGPL-3.0 /gcc/COPYING3.LIB

Looking at the SPDX specification the rule is to define them as being under all. This is the exact text regarding this,

Representing Multiple Licenses

Multiple licenses can be represented using a SPDX license expression as defined in Appendix IV. A set of licenses must be enclosed in parentheses (this is a convention for SPDX expressions). As further described there:

When there is a choice between licenses (“disjunctive license”), they should be separated with “OR”. If presented with a choice between two or more licenses, use the disjunctive binary “OR” operator to construct a new license expression.
Similarly when multiple licenses need to be simultaneously applied (“conjunctive license”), they should be separated with “AND”. If required to simultaneously comply with two or more licenses, use the conjunctive binary “AND” operator to construct a new license expression.
In some cases, a set of license terms apply except under special circumstances, in this case, use the “WITH” operator followed by one of the recognized exception identifiers.
Sometimes a set of license terms apply except under special circumstances. In this case, use the binary “WITH” operator to construct a new license expression to represent the special exception situation.
Examples:

SPDX-License-Identifier: (GPL-2.0 OR MIT)
SPDX-License-Identifier: (LGPL-2.1 AND BSD-2-CLAUSE)
SPDX-License-Identifier: (GPL-2.0+ WITH Bison-exception-2.2)

This was taken from https://spdx.org/spdx-specification-21-web-version Appendix V: Using SPDX short identifiers in Source Files

Fair enough. We then need to write one last version of our license guesser which will use everything done above BUT support multiple licenses and look for license files inside root directories and make anything below it as belonging to that one, unless of course there is another license file or it is over-ridden by the header. We are not trying to keep track of the above WITH or AND logic described above as I just want to mark each file with the potential license. As such just marking it so is good enough.

For this I simply copied the existing attempt2.py and created attempt3.py which should hopefully be the last version for this. What we now need to do is walk the tree, and every time we encounter a license file, keep track of the license that the file should be under, and for any file that is a child of that license mark as such. When we leave the directory then we pop the license off.

Thankfully this was a very easy change to implement. I used a project which I knew to already have some mixed licenses and ran over it.

 >>>>>>> /decodingcaptchas/LICENSE GPL-3.0
 >>>>>>> /decodingcaptchas/LICENSE GPL-3.0
 >>>>>>> /decodingcaptchas/LICENSE GPL-2.0
 >>>>>>> /decodingcaptchas/LICENSE GPL-2.0
 /decodingcaptchas/Gruntfile.js GPL-2.0 GPL-3.0
 /decodingcaptchas/code/crack.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/crack_test.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/geticons.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step1.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step2.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step3.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step4.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/style.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/reveal.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/print/paper.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/print/pdf.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/beige.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/black.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/blood.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/league.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/moon.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/night.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/serif.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/simple.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/sky.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/solarized.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/white.css GPL-2.0 GPL-3.0
 /decodingcaptchas/js/reveal.js GPL-2.0 GPL-3.0
 >>>>>>> /decodingcaptchas/lib/LICENSE Fair-Source-0.9
 /decodingcaptchas/lib/css/zenburn.css Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/lib/font/league-gothic/league-gothic.css Fair-Source-0.9 GPL-2.0 GPL-3.0
 >>>>>>> /decodingcaptchas/lib/font/source-sans-pro/LICENSE OFL-1.1
 >>>>>>> /decodingcaptchas/lib/font/source-sans-pro/LICENSE OFL-1.1
 /decodingcaptchas/lib/font/source-sans-pro/source-sans-pro.css Fair-Source-0.9 GPL-2.0 GPL-3.0 OFL-1.1
 /decodingcaptchas/lib/js/classList.js Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/lib/js/head.min.js Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/lib/js/html5shiv.js Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/highlight/highlight.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/leap/leap.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/markdown/markdown.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/markdown/marked.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/math/math.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/multiplex/client.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/multiplex/index.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/multiplex/master.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/notes/notes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/notes-server/client.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/notes-server/index.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/print-pdf/print-pdf.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/remotes/remotes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/search/search.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/zoom-js/zoom.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/qunit-1.12.0.css GPL-2.0 GPL-3.0
 /decodingcaptchas/test/qunit-1.12.0.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-markdown-element-attributes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-markdown-slide-attributes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-markdown.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-pdf.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test.js GPL-2.0 GPL-3.0

What this shows is each file along with each license that the program believes it to belong to. Files prefixed with >>>>>>> indicate a license file which changes the licenses of the files below it.

The root of the project is dual licensed under GPL-2.0 and GPL-3.0 which makes everything have at least those two licenses. There are however also some fonts under the SIL Open Font License 1.1 which were also picked up. To make things a little harder I added the Fair Source license under the lib directory (just as a test) to see if this would be reflected correctly.

I am pretty happy with the result. So I took the code and ran it against GCC and the Linux Kernel. The output for both are too large to post on this blog but you can view the GCC one as a gist https://gist.github.com/boyter/ee534011cf512ba7e2992ecdef87523c

The last part was to add back the file check so that we can identify license headers for each file in the subfolders. Thankfully this was just a short tweak. Of course the generation is now much slower due to add the additional processing but it seemed to work pretty well.

Well that about covers that. Everything seems to work as I would expect and it would be reasonably easy at this point to take the code and produce a full blown SPDX parser. I will leave that as an exercise for the reader (unless it turns out I need one at a later date).

The next step for me is to integrate this into searchcode server, which is just porting the above into Java and then adding this information as a facet for searching. This is something I may write about later. If you are interested in a code search application though please check it out!

I have uploaded all the code into the following github repository released under the MIT License. Note that the database will also be released under Creative Commons inside the searchcode-server project itself and it is also something I will be keeping up to date so it might be a better source if you want to do any license matching.

Why is this GoLang solution faster than the equivalent Java Solution?

The below is a small story about how I took a program with a runtime of hours to minutes, to tens of seconds to seconds. More for my own personal amusement then anything else.

At work there is a tradition of a Friday quiz being posted by the winner of the previous week. I missed out on the most recent one due to having to duck off early to do my tax but the problem was rather an interesting one.

The challange itself is not as simple as you would initally think and taken from a 2015 IBM Ponder This https://www.research.ibm.com/haifa/ponderthis/challenges/May2015.html

Three people are playing the following betting game.
Every five minutes, a turn takes place in which a random player rests and the other two bet
against one another with all of their money.
The player with the smaller amount of money always wins,
doubling his money by taking it from the loser.
For example, if the initial amounts of money are 1, 4, and 6,
then the result of the first turn can be either
2,3,6 (1 wins against 4);
1,8,2 (4 wins against 6); or
2,4,5 (1 wins against 6).
If two players with the same amount of money play against one another,
the game immediately ends for all three players.
Find initial amounts of money for the three players, where none of the three has more than 255,
and in such a way that the game cannot end in less than one hour. (So at least 12 turns)
In the example above (1,4,6), there is no way to end the game in less than 15 minutes.
All numbers must be positive integers.

Only one person managed to find an answer, lets call him Josh (because that’s his name), having spent a few hours writing up a solution using his favourite programming language Go. Come Monday morning I arrived, looked at the quiz I and became intrigued. Could I write a version that would outperform his. After all Go is a pretty performant language, but I suspected that he may have missed some easy optimisations, and if I picked something equally as fast I should be able to at least equal it.

I took a copy of his code https://gist.github.com/walesey/e2427c28a859c4f7bc920c9af2858492 (since modified to be much faster) with a runtime of 1 minute 40 seconds (on my laptop) started work.

Looking at the problem itself we have a few pieces of information we can use. One is that we don’t ever need to calculate the 12th turn. If the game makes the 11th turn and still continues then it made 12 so that saves us some calculations. Another is that if any money amounts are the same we can stop the game instantlty, and more importantly not even add those to the loop.

Given that the existing solution was written in Go it seemed insane to some that I started writing my solution at least initally in Python. This however is not as crazy as it seems. Python being higly malleable allow some rapid iteration, trying out a few things before moving over to another language.

The first thing to note is that you need to generate a tree of every possible combination that a game can take. Then you iterate over each one for the inital starting amounts of money to determine if the game ever ends, and if not mark it as a game that does not finish. To generate the combinations I went with a fairly simple recursive strategy,


# Calculate all the possible sequences for who misses out for each turn
def calc_events(current=[], turn=0):
    if turn == DESIRED_TURNS:
        return [current]

    one = list(current)
    one.append(0)

    two = list(current)
    two.append(1)

    three = list(current)
    three.append(2)

    turn += 1
    path1 = calc_events(current=one, turn=turn)
    path2 = calc_events(current=two, turn=turn)
    path3 = calc_events(current=three, turn=turn)

    return path1 + path2 + path3

The result of running the above is an array of arrays containing each situation where someone sits out a turn. It is important to note that his produces a list that modifies from the back (big-endian so to speak) as this will be a very important consideration later.

The result looks something like this (truncated to just 4 results),


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

Given the above I coded up a simple solution which simply looped through the combinations. Here it is in pseudocode.


for player1 money:
  for player2 money:
    for player3 money:
      for game in games:
        for turn in game:
          result = play turn using money amounts
        if result:
          print player1, player2, player3

The above is about as bad as it gets for an algorithm as we have 5 nested loops. As expected the runtime performance was horrible. In fact I added a progress bar which simply calculated how far into the first loop the application had reached. Using Python I worked out that it was going to take several hours to get a result using the above. Not good enough.

A few calculations and I realised that I was asking my poor program to calculate something like 50 billion games. Clearly I needed to reduce the number of games and speed up the processing as much as possible. The first change I made was to only calculate the game turn events once and keep the result. I had also written the code to be as readable as possible, which how you should write anything but as this can be an issue also removed the following function by inlining it for improved performance.


def turn(p1, p2):
    if p1 == p2:
        return False, p1, p2

    if p1 > p2:
        p1 = p1 - p2
        p2 = p2 + p2
    else:
        p2 = p2 - p1
        p1 = p1 + p1
    return True, p1, p2

The next thing I did was generate all the permutations of money amounts to play games with into a single list. The last thing I did was switch from using Python to PyPy which with its JIT should speed up the loops considerably. The result of all this can be found here https://gist.github.com/boyter/cab749f4713201f5b409c5b1353fc36c and its runtime using PyPy dropped to ~8 minutes.

8 minutes was about 5 times slower then the GoLang program at this point, which is pretty good for a dynamic language like Python, and consider that my implementation was single threaded where as the Go was using as many cores as it could get. My next step was to implement the same program in a faster language. The only other faster languages I know that I have any experience in are C# and Java. Since I already had Java setup I went with that. I will also admint it was about proving a point. Go may be fast, but at writing Java should be able to equal it for most tasks.

At this point however I mentioned to Josh that his Go program had some inefficiencies. The big one being that he was calculating the 12th game needlessly. I then modified his program and with some other changed reduced the Go program runtime down to ~40 seconds. I was starting to get worried at this point, as I was aiming to beat the Go program’s performance by 5%. No idea why I picked this number but it seemed reasonable if I was smart with the implementation.

At first I ported the Python program in its original readable reusable form to Java and ran it. The runtime was ~7 minutes. I then inlined the same functions and converted it over to use parallel streams. This time the runtime was about 90 seconds. This would have been fast enough had I not mentioned to Josh how he could improve his code. I had shifted the goalposts on myself and had a new target now of ~40 seconds.

After doing a few things such as inlining where possible, changing to enum and some other small tweaks I had a runtime of ~60 seconds. The big breakthrough I had was after looking at the hot function I realised that storing the game events in a List of Arrays meant that we had a loop in a loop. It was however possible to flatten this into a single array of integers and reset the game every 11 turns with a simple if check.

This was the breakthrough I needed. Suddently the runtime dropped from ~60 seconds to about 23 seconds. I happily posted my success in the Slack channel with a link to the code and sat back.

The smile was soon turned to a frown. Josh implemented the same logic into his Go program and it now ran in ~6 seconds. It was suddently 5x times faster. At this point I had a mild panic and started playing with bitwise operations and other micro optimisations before realising that no matter what I did he could simply implement the same change and get the same performance benefit as we were both using roughtly the same algorithm.

Something was clearly wrong. Either Go was suddently 5x faster at a basic for loop and integer math then Java OR I had a bug in my code somewhere which was making it worse. I looked. Josh looked. A few other people looked. Nobody could work out what the issue was. At this point I didn’t care about the runtime, I just wanted to know WHY did it appear to be running slower. I was so desperate for an answer I did what all programmers do at this point and outsourced to the collective brain known as Stack Overflow http://stackoverflow.com/questions/43082115/why-is-this-golang-solution-faster-then-the-equivalent-java-solution

A few un-constructive comments came back such as Java is slow (seriously its not 1995 anymore guys, Java is fast) etc… Thankfully one brilliant person managed to point out what I had missed. It was not the loop itself, but the input. Remember how I said that the generation of the events being big-endian was important? Turns out the Go program had done the reverse and implemented it little-endian.

The thing about the core loop is that it has a bail-out condition. If two players have the same money amount we end the game and don’t process any further. The worst possible situation for the loop is to process almost every condition only to find out just at the end that it ended. Its a lot of processing work. Ideally you want to find the failing conditions as soon as possible. By changing the games from the end I was forcing the Java program to process about 5x times as many combinations as the Go program.

It just happended to be that Josh has picked a more optimal path through the games.

A simple reverse of the games (line 126 of the linked solution) and suddenly the Java program was running in about ~6 seconds and the same time as the Go program. You can view the code where https://gist.github.com/boyter/42df7f203c0932e37980f7974c017ec5

For fun I tried running it on a 16 core VPS and it ran in about ~2 seconds and maxed out all the cores so it seems parallel streams do what you expect them to.

Interestingly while Josh’s starting positions was more optimal then mine, its probably still not the optimal path for this problem. There is bound to be a way to generate the game such that you hit the failing conditions as soon as possible saving needless processing. There is probably a Thesis for a PhD in there somewhere.

I figure this is probably as far as I want to take this. I did play around with bitwise operations and loop un-rolling but the time didn’t change that much.

I certainly had fun with the implementation and working things out. Some may argue that optimising for a micro benchmark such as this is a waste of time, and generally they would be right. That said there are occasions where you really do need to optimise the hell out of something, say a diff algorithm or some such. In any case what developer does not dream of saving the day with some hand unrolled loop optimisation that saves the company millions and brings the developer the praise and respect of their peers!

EDIT – A rather smart person by the name of dietrichepp pointed out that there is a better algorithm. Rather than brute force the states work backwards. You can read their comment on Hacker News and view their code in C++ on Github.

Setup up ConcourseCI 2.6.0 behind Nginx with Self Signed Certificates on Ubuntu 16.04

Concourse CI is a very nice continuous integration server.

However for installs there are a few gotcha’s you need to keep in mind. Mostly these relate to how TLS/SSL works.

The first is that while it is possible to run concourse inside Docker I found this to cause a lot of issues with workers dying and not recovering. I would suggest installing the binarys on bare machines. When I moved from a docker cluser using Amazon’s ECS to a single t2.large instance not only were builds faster but it was a far more reliable solution.

I am also not going to automate this install, and will leave it as an excercise for you the reader to do this yourself. I would suggest using Python Fabric, or something like Puppet, Ansible or Saltstack to achive this.

Also keep in mind that with this install everything is running on a single instance. If you have need to scale out this is not going to work, but as a way to get started quickly it works pretty well.

Prerequisites are that you have a Ubuntu instance running somewhere. If you want to run the fly execute command you will also need a valid domain name to point at your machine. This is an annoying thing caused by GoLang when using SSL certs. Turns out you cannot set a hostfile entry and use it as such. You can in a insecure non SSL mode but otherwise cannot.

If you are using a virual machine from DigitalOcean/AWS/Vultr or other you will need to add some swap space. I noticed a lot of issues where this was missing. You can do so by running the following commands which will configure your server to have 4G of swap space,

sudo fallocate -l 4G /swapfile
 sudo chmod 600 /swapfile
 sudo mkswap /swapfile
 sudo swapon /swapfile
 sudo echo "/swapfile none swap sw 0 0" >> /etc/fstab

We will need to get the concourse binary, and to make it executable. For convenience and to match the concourse documentation lets also rename it to concourse. To do so run the following command.

wget https://github.com/concourse/concourse/releases/download/v2.6.0/concourse_linux_amd64 && mv concourse_linux_amd64 concourse && chmod +x concourse

We now need to generate the keys that concourse requires.

mkdir keys
 cd keys

ssh-keygen -t rsa -f tsa_host_key -N '' && ssh-keygen -t rsa -f worker_key -N '' && ssh-keygen -t rsa -f session_signing_key -N '' && cp worker_key.pub authorized_worker_keys
 cd ..

The above commands will create a directory called keys and setup all of the keys that concourse 2.6.0 requires.

We can now create some helper scripts which we can use to run concourse easily.

pico concourse.sh

./concourse web \
 --basic-auth-username main \
 --basic-auth-password MySuperPassword \
 --session-signing-key ./keys/session_signing_key \
 --tsa-host-key ./keys/tsa_host_key \
 --tsa-authorized-keys ./keys/authorized_worker_keys \
 --external-url https://YOURDNSHERE/ \
 --postgres-data-source postgres://concourse:concourse@127.0.0.1/concourse

chmod +x concourse.sh

This script will start running concourse. Keep in mind that the username and password used here are for the main group and as such you should protect them as they have the ability to create additional groups on your concourse instance.

pico worker.sh

./concourse worker \
 --work-dir /opt/concourse/worker \
 --tsa-host 127.0.0.1 \
 --tsa-public-key ./keys/tsa_host_key.pub \
 --tsa-worker-private-key ./keys/worker_key

chmod +x worker.sh

This script will spin up a worker which will communicate with the main concourse instance and do all the building. It can be useful to lower the priority of this command using nice and ionice if you are running on a single core machine.

Now we need to install all of the postgresql packages required,

apt-get update && apt-get install -y postgresql postgresql-contrib

Once this is done we can create the database to be used

sudo -u postgres createdb concourse

Then login to postgresql and create a user to connect to the database

sudo -u postgres psql
 CREATE USER concourse WITH PASSWORD 'concourse'
 GRANT ALL PRIVILEGES ON DATABASE "concourse" to concourse
 \du

We also need to need to edit the pg_hba file allowing us to make the connection,

sudo pico /etc/postgresql/9.5/main/pg_hba.conf

Scroll down and look for the following line,

host all all 127.0.0.1/32 md5

and change the md5 on the end to trust

host all all 127.0.0.1/32 trust

Then save the file and restart postgresql

service postgresql restart

At this point everything we need to run concourse should be there. You will need to setup the concourse scripts we created earlier to run as a service, or just run them in a screen session if you are in a hurry.

What we want to do now is expose it to the big bad internet.

apt-get install nginx

Create a directory using either the domain name you want to use, a desired name or anything if you are going to connect to things using IP addresses.

mkdir -p /etc/nginx/ssl/mydesireddomain.com

Nowe we want to swtich to the directory and setup the self signed TLS/SSL keys.

cd /etc/nginx/ssl/mydesireddomain.com
 openssl genrsa -des3 -out server.key 1024

Enter whatever you want for the passphrase but remember it!

openssl req -new -key server.key -out server.csr

Enter the passhrase entered. The most important thing here is that when asked for the Common Name or FQDN you need to enter in the desired domain name.

With that done we need to sign the key.

cp server.key server.key.org
 openssl rsa -in server.key.org -out server.key

Remember to enter the same pass phrase as before. Finally sign they key with an expiry of 9999 days.

openssl x509 -req -days 9999 -in server.csr -signkey server.key -out server.crt

Make a copy of the file server.crt which will be needed for the concourse fly tool to talk to the server if you are using self signed certs.

With that done lets enable the site,

sudo nano /etc/nginx/sites-available/mydesireddomain.com

And enter in the following details,

upstream concourse_app_server {
 server localhost:8080;
 }

server {
 listen 80 default_server;
 rewrite ^ https://MYIPADDRESSORDOAMIN$request_uri? permanent;
 }

server {
 listen 443 default_server ssl http2;
 server_name mydesireddomain.com;

ssl on;
 ssl_certificate /etc/nginx/ssl/mydesireddomain.com/server.crt;
 ssl_certificate_key /etc/nginx/ssl/mydesireddomain.com/server.key;

location / {
 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 proxy_set_header Host $http_host;
 proxy_redirect off;
 proxy_pass http://concourse_app_server;
 }
 }

The above nginx config defines an upstream concourse server running on port 8080. It then defines a server listening on port 80 that redirects all traffic to the same server on HTTPS. The last server config defines out site, sets up the keys and forwards everything to the upstream concourse server.

We can now enable it,

sudo ln -s /etc/nginx/sites-available/mydesireddomain.com /etc/nginx/sites-enabled/mydesireddomain.com
 rm -f /etc/nginx/sites-available/default

service nginx restart

We need to remove the default file above because nginx will not allow you to have two default servers.
At this point everything should be working. You should now be able to connect to your concourse server like so,

fly --target myciserver login --team-name myteamname --concourse-url https://MYIPADDRESSORDOAMIN --ca-cert ca.crt

Where the file ca.crt exists whereever you are running fly. Everything at this point should work and you can browse to your concourse server.

If you are using the IP address to communicate to your concourse server you have the limitation that you will not be able to run fly execute to upload a task. You can get around this by using a real domain, or running your own DNS to resolve it correctly.

The only bit of homework at this point would be to configure the firewall to disable access to port 8080 so that everything must go though nginx. Enjoy!

Repository overview now in searchcode server

One feature that I have wanted for a long time in searchcode server was a page which would give an overview of a repository. I wanted the overview to give a very high look at the languages used, the total number of files, estimated cost and who would be the best people to talk to.

One thing that occurred to me when I started work was that it would be nice to calculate a bus factor for the repository as well. After all we all know that project managers do like to know who are the most critical contributors to any project and what the risk is.

Below is the what has been added into searchcode server.

The most interesting part of the above in my opinion is the overview blurb. It attempts to summarise all of the figures below and let anyone know in plain english where the repository stands.

An example for searchcode server’s code itself,

“In this repository 2 committers have contributed to 228 files. The most important languages in this repository are Java, CSS and Freemarker Template. The project has a low bus factor of 2 and will be in trouble if Ben Boyter is hit by a bus. The average person who commits this project has ownership of 50% of files. The project relies on the following people; Ben Boyter, =.”

Certainly there is room for improvement on this page and I am hoping to add what I am calling signals logic to it. This would involve scanning the code to determine what languages, features and libraries are being used and add those to the report. The end goal would be to find for instance C# code using MySQL and ReactJS.

The last bit of news is that I am moving searchcode.com over to the same codebase. This should improve things in a few ways. The first bring the improved performance when moving from Python to Java. It should also mean that I can focus on a single codebase.

Anyway you will be able to get the new repository overview in the next release of searchcode server 1.3.6 which will be released before the end of January 2017.

Sphinx Real Time Index How to Distribute and Hidden Gotcha

I have been working on real time indexes with Sphinx recently for the next version of searchcode.com and ran into a few things that were either difficult to search for or just not covered anywhere.

The first is how to implement a distributed search using real time indexes. It’s actually done the same way you would normally create an index. Say you had a single server with 4 index shards on it and you wanted to run queries against it. You could use the following,

index rt
{
    type = distributed
    local = rt1
    agent = localhost:9312:rt2
    agent = localhost:9312:rt3
    agent = localhost:9312:rt4
}

You would need to have each one of your indexes defined (only one is added here to keep the example short)

index rt1
{
    type = rt1
    path = /usr/local/sphinx/data/rt1
    rt_field = title
    rt_field = content
    rt_attr_uint = gid
}

Using the above you would be able to search across all of the shards. The trick is knowing that to update you need to update each shard yourself. You cannot pass documents to the distributed index but instead must make a separate update to each shard. Usually I split sphinx shards based on a query like the following,

SELECT cast((select id from table order by 1 desc limit 1)/4 as UNSIGNED)*2, \
         cast((select id from table order by 1 desc limit 1)/4 as UNSIGNED)*3 \
         FROM table limit 1

Where the 4 is the number of shards and the multiplier splits the shards out. It’s performant due to index use. However for RT I suggest a simple modulas operator % against the ID column for each shard as it allows you to continue to scale out to each shard equally.

The second issue I ran into was that when defining the attributes and fields you must define all the fields before the uints. The above examples work fine but the below is incorrect. I couldn’t find this mentioned in the documentation.

index rt
{
    type = rt
    path = /usr/local/sphinx/data/rt
    rt_attr_uint = gid # this should be below the rt_fields
    rt_field = title
    rt_field = content
}

Explaining VarnishHist – What Does it Tell Us

The varnishhist tool is one of the most underused varnish tools that come with your standard varnish install. Probably because of how it appears at first glance.

In short, you want as many | symbols as possible and you want everything far toward the left hand side. The closer to the left the faster the responses are regardless if they are cached or not. The more | symbols then more items were served from cache.

A small guide,

'|' is cache HIT
'#' is cache MISS
'n:m' numbers in left top corner is vertical scale
'n = 2000' is number of requests that are being displayed (from 1 to 2000)

The X-axis is logarithmic time between request request from kernel to Varnish and response from Varnish to kernel.

The times on the X-axis are as such,

1e1 = 10 sec
1e0 = 1 sec
1e-1 = 0.1 secs or 100 ms (milliseconds)
1e-2 = 0.01 secs or 10 ms
1e-3 = 0.001 secs or 1 ms or 1000 µs (microseconds)
1e-4 = 0.0001 secs or 0.1 ms or 100 µs
1e-5 = 0.00001 secs or 0.01 ms or 10 µs
1e-6 = 0.000001 secs or 0.001 ms or 1 µs or 1000 ns (nanoseconds)

Below is the varnishhist for searchcode.com showing that while most responses are served in about 100ms not many are cached. This can mean one of a few things.

  • The responses are not cache-able and you need to adjust the back-end responses to have the correct headers (or override the settings with VCL config).
  • The cache timeout for the back-end responses isn’t high enough to ensure that later requests are served from cache.
  • There isn’t a large enough cache to hold all the responses (that’s the problem in this case).
1:20, n = 2000




                                  ##
                                  ##
                                  ##
                                  ##
                                  ##
                                  ##
                                  ##
                                 ###
                                 ####
            |                    ####
            |                    ####
            |                    ####
            |                    #####
            |                    #####
            |                    #####
            |                   #######
            |                   #######
            ||  |    #      #   ##########
+------+------+------+------+------+------+------+------+------
|1e-6  |1e-5  |1e-4  |1e-3  |1e-2  |1e-1  |1e0   |1e1   |1e2

MySQL Dump Without Impacting Queries

Posted more for my personal use (I have to look it up every time) but here is how to run a mysqldump without impacting performance on the box. It sets the ionice and nice values to be as low as possible (but still run) and uses a single transaction and ups the max packet size for MySQL.

ionice -c2 -n7 nice -n19 mysqldump -u root -p DATABASE --single-transaction --max_allowed_packet=512M > FILENAME

To all Companies Currently Recruiting

I am writing this on behalf of all developers/engineers out there. Please stop with the take home coding challenge questions. Really. Just stop it. They are a lazy and frankly an unprofessional way of sorting the wheat from the chaff. Before closing your browser in disgust hear me out on this one and hopefully I can convince you of the error of your ways.

There has become an alarming trend these days of companies during the hiring process to issue lengthy coding challenges in order to prove that the individual they are hiring knows their stuff. I totally understand why you might be doing this but frankly its flawed. Lets go through several reasons why.

1. It shows that you have a lack of regard for the individuals time. This is a serious red flag. After all if you fail to respect my time why should I respect yours? Consider it this way, for every hour of time that your test takes you are taking away 1 hour of that persons family/rest/sleep/eating/hobby time. What is your commitment in this situation? If you aren’t willing to invest the time in the hiring process then why should we? Sure you need to review the code (and hopefully give feedback) but that usually takes less than 15 minutes (yes I have done this and you can do it very quickly). The time investment balance is horribly skewed in your favour, which starts the relationship off to a bad start. Lets also consider those who do consulting or contracting outside. If you ask them to do a 3 hour coding challenge then you are asking them to forgo somewhere in the region of $300-700. Even worse is that usually at the end of these tests all you get back is “Pass” or “Fail”. Even when you do give some feedback there is no way it pays back the amount of time that went in.

2. Long tests are irrelevant anyway. When you set someone up to solve a problem that takes several hours you are setting them up to fail. The test you set is usually set up such that you are looking for a specific answer and only that answer. It might be obvious to you that using the strategy pattern is the best way to solve your test, but what about those who’s development backgrounds require business logic rules to be user editable and hence live in a database where it can be modified? You are also going to nitpick over every little detail in a classic case of bike shedding. “They didn’t clean up some spaces!” “They didn’t comment this method!” “They did comment this method!” Even aware of this problem I still find myself falling into the trap of nitpicking over the small stuff that usually doesn’t matter. You are going to be investing time in training them how you want things anyway so why expect them to know it without even working with you?

3. It’s going to drive away talented individuals. I am not counting myself in this category but I know of many talented dev’s who simply refuse to do any sort of test. Why? They have a huge online portfolios of work they have previously done. Why do you even ask for our Github/Gitlab profiles if not to look at them? Surely 5 minutes of looking though would let you know if this person is a pretender or knows their stuff.

4. The hiring process is skewed towards the hirer. People looking for work are usually in one of two situations. Either they have a role and are looking around or they don’t have one. Lets consider the risk for both groups.

For those with a job they will need to resign from their current position to take an opportunity with you. There is usually a probation period following the hiring that either party can say “No this isn’t for me lets agree to be friends”. The hiree is the one taking on the largest risk in this situation. They have already left what possibly was a stable situation for an unstable situation. The company takes on the risk of potentially investing in someone and they leave. Realistically though the company is in the position to hire so renumeration is the least of their problems. For the hiree though the potential is that they join and after a month get the boot. They are now without income and back to looking for a job, only now they are in the latter category.

For those without a job, they have the situation that they might find employment and in doing so passing on other opportunities only to get the boot after a few months. Once again its skewed towards the company.

5. They are easily gamed. A quick check on freelancer or other such sites shows that I can probably buy a solution to any coding challenge you have for less than $300. For a job with a steady pay-check even if its only for a few months (before you find out I can’t code a damn) this is a no brainer especially if I know my coding chops aren’t up to standard. But why pay at all? A lot of the solutions are posted on online forums and Github already. Heck just ask your best mate who happens to know their stuff to do it for you. Whats the cost to the organisation here? Well assuming you do hire me based on my fraudulent test I could lurk in the company for months, either contributing nothing or perhaps causing all sorts of damage. At the very least you have paid several months salary.

A-hah you say! I can defeat what you have written above! I will just have another test during the interview process I can hear you thinking. DING DING DING WINNER! However, if you are going to do that anyway why not drop the initial challenge? After all what are you gaining?

Here’s what I propose. Bring the person you want to interview in and have them actually code on a machine in front of you. Work through a few simple problems together. It dons’t need to be complex. Ask them to write a function that reverses all the words in a string but not the string itself. Ask them how to find elements that existing in list A that are not in B. Ask them why they implemented certain patterns, how did they decide on data types, why did they comment/not comment a method. Let them know in advance that this is what is going to happen and what will be expected. You will learn far more in this short investment of time then with any coding challenge. If you want to get more in depth then why not offer to pay them for a single day to come in and actually work. The monetary cost will be less than making the wrong hire and you both get to decide if things are going to work out.

The best interview process I have been with to date was actually the Microsoft one for a intern role many many moons ago. It involved several hours of interviews with different individuals discussing different technology roles to try and ascertain the best fit.

When I walked out (I didn’t get the role BTW) I felt like a better person. Not only were the discussions interesting, I learnt a lot from those conducting them and I felt like my time was valued. They even offered to cover my travel expenses which made me feel like they cared about my time investment. This is how the process is meant to work. Its as much about the person being interviewed as about the company. Consider it an investment in your advertising budget if you are tracking the time investment (yes it is an investment!) as a cost. Good interviews stick with people a long time and GREAT ones make those people want to praise your company.

GPL Time-bomb an interesting approach to #FOSS licensing

UPDATES Following some feedback I am going to rename my usage of “Time-Bomb” due to potential negative connotation on the words. I am going to call it “Eventually Open”. Also a few other things need mentioning. I am not looking for code submissions back into the source at this time. This was a move to show that there are no back-doors in the code sending source code back to a master server.

About a week ago I released searchcode server under the fair source licence. From day one I had wanted to release it using some form of licence where the code was available but I wanted to lock it somewhat because frankly I do want to make some money out of my time investment. That’s not the whole story however. I did not want to create another “Look but don’t touch” situation forever and I certainly didn’t want searchcode to be constrained by a licence in the event that I die, lose interest or stop updating the code.

The result of this was that I have added what I am going to call a GPL Time-Bomb into into the licencing of searchcode server. Here is how it works. After a specified period of time the current version of searchcode server can be re-licensed under the GPL v3. This is a shifting date such that each new release extends its own time-bomb further into the future. However the older releases time is still fixed. The time-bomb for version 1.2.3 and 1.2.4 takes place on the 27th August 2019 at which point you can take the source using GPL 3.0. Assuming searchcode server 6.1.2 comes out at roughly the same time its time-bomb will be set to the 27th of August 2022 but the 1.2.3 release will be unaffected.

In short I have put a time limit of 3 years to make money out of the product and if I am unable it is turned over to the world to use as they see fit. Even better, assuming searchcode server becomes a successful product I will be forced to continually improve it and upgrade if I want to keep a for sale version without there being an equivalent FOSS version around (which in theory could be maintained by the community). In short everyone wins from this arrangement, and I am not forced to rely on a support model to pay the bills which frankly only works when you have a large sales team.

Here’s hoping this sort of licencing catches on as there are so many products out there that could benefit from it. If they take off the creators have an incentive to maintain and not milk their creation and those that become abandoned even up available for public use which I feel is a really fair way of licencing software.

Agree? Disagree? Email me or hit me up on twitter.