A quick comparison between different Go file walk implementations

Whats the fastest way to get all the names of all files in a directory using Go? I had a feeling that the native walk might not be the fastest way to do it. A quick search showed that several projects claimed to be faster. Since the application I am currently working on needs a high performance scanner I thought I would try the main ones out.

Note that I have updated the code and the results based on feedback from reddit. The first change is I set it to just count the files rather than print the output to avoid measuring output buffering. I did do this before but noticed that while running in hyperfine it made no difference. I updated it anyway to avoid this being called into question again. The second was based on feedback from the godirwalk author. Setting the “unsorted” true option manages to pull another ~150ms of speed out of the bag which is perfect for me. Since the goroutine implementations have the same sorting issue (as far as I can see) it seemed fair to turn it on.

package main

import (

func main() {
	count := 0
	filepath.Walk("./", func(root string, info os.FileInfo, err error) error {
		if err != nil {
			return err

		return nil


package main

import (

func main() {
	count := 0
	walk.Walk("./", func(root string, info os.FileInfo, err error) error {
		if err != nil {
			return err


		return nil


package main

import (

func main() {
	count := 0
	cwalk.Walk("./", func(root string, info os.FileInfo, err error) error {
		if err != nil {
			return err

		return nil
package main

import (

func main() {
	count := 0
	godirwalk.Walk("./", &godirwalk.Options{
		Unsorted: true,
		Callback: func(osPathname string, de *godirwalk.Dirent) error {
			return nil
		ErrorCallback: func(osPathname string, err error) godirwalk.ErrorAction {
			return godirwalk.SkipNode

And the results. All were run in the WSL for Linux on a Surface Book 2 against a recent checkout of the Linux kernel with there being 67359 files in the directory.

$ hyperfine './cwalk' && hyperfine './godirwalk' && hyperfine './nativewalk' && hyperfine './walk'
Benchmark #1: ./cwalk

  Time (mean ± σ):      1.812 s ±  0.059 s    [User: 368.4 ms, System: 6545.8 ms]

  Range (min … max):    1.753 s …  1.934 s

Benchmark #1: ./godirwalk

  Time (mean ± σ):     695.9 ms ±  16.7 ms    [User: 73.0 ms, System: 619.2 ms]

  Range (min … max):   671.2 ms … 725.6 ms

Benchmark #1: ./nativewalk

  Time (mean ± σ):      3.896 s ±  0.489 s    [User: 153.0 ms, System: 3757.4 ms]

  Range (min … max):    3.560 s …  5.034 s

Benchmark #1: ./walk

  Time (mean ± σ):      1.674 s ±  0.071 s    [User: 399.7 ms, System: 6383.3 ms]

  Range (min … max):    1.571 s …  1.769 s

For comparison ripgrep which is probably the fastest disk scanner comes in at ~600ms. That is not a fair comparison though as it ignores certain directories but it gives you an idea of the upper bounds of useful performance.

Turns out that the native implementation that ships with Go is indeed the slowest. The fastest by a long shot is godirwalk however. It is at least 2x as fast as the next quickest implementation. So if bleeding performance matters it would seem that using godirwalk is the best option. If however you want a drop in replacement for some additional speed I would suggest going with cwalk or walk. Of course if you aren’t scanning the linux kernel its hard to go wrong with even the native implementation which is generally fast enough for most cases.

Collection of my favorite optimization posts and articles

A collection of my favorite posts to read and re-read about optimizing code to an extreme. Unlikely that I will ever need to go to the extremes that these very talented individuals go to but its nice to learn the techniques.

In no particular order.

Will attempt to keep this list up to date as I find other content that really impresses me.

Licensechecker. A command line application which identifies what software license things are under

A simple blog post here to introduce a new command line tool licensechecker (lc), which is similar in purpose to the library Licensee http://ben.balter.com/licensee/ which attempts to identify what software license(s) code is released under. lc itself is dual licensed under the MIT and Unlicense.

Licensechecker (lc) was designed to be run either on your command line or using your CI tool of choice and produce either CSV, JSON, SPDX, Summary or CLI tabular or progress output with the option to write results to disk. It is reasonably fast and is cross platform with binaries for x64 versions of Linux, macOS and Windows. The build process also ensures that it builds on ARM and i386.

You can find both the binaries and code on GitHub https://github.com/boyter/lc/

Some time ago I wrote a collection of Python scripts which would when tweaked would scan a file directory and tell you what licenses the files that existed within in were under http://www.boyter.org/2017/05/identify-software-licenses-python-vector-space-search-ngram-keywords/

Licensechecker has taken the ideas I toyed around in Python and produced a much more battle tested command line application written in Go which is ready for general use. I chose to write it in Go simply because at my current role there is a push to use Go and I have a lack of experience with it.

It is reasonably well tested (always room for improvement), and while there are some easy wins that could be done for performance improvements it is reasonably fast.

Some sample outputs of what it can produce follow.

Example output of licencechecker running against itself in tabular format while ignoring the .git, licenses and vendor directories

$ lc -pbl .git,vendor,licenses -f tabular .
Directory            File                    License                            Confidence  Size
.                    .gitignore              (MIT OR Unlicense)                 100.00%     275B
.                    .travis.yml             (MIT OR Unlicense)                 100.00%     188B
.                    CODE_OF_CONDUCT.md      (MIT OR Unlicense)                 100.00%     3.1K
.                    CONTRIBUTING.md         (MIT OR Unlicense)                 100.00%     1.2K
.                    Gopkg.lock              (MIT OR Unlicense)                 100.00%     1.4K
.                    Gopkg.toml              (MIT OR Unlicense)                 100.00%     972B
.                    LICENSE                 Unlicense AND MIT                  94.83%      1.1K
.                    README.md               (MIT OR Unlicense)                 100.00%     7.5K
.                    UNLICENSE               MIT AND Unlicense                  95.16%      1.2K
.                    database_keywords.json  (MIT OR Unlicense)                 100.00%     3.6M
.                    main.go                 (MIT OR Unlicense)                 100.00%     3.4K
.                    what-we-look-at.md      (MIT OR Unlicense)                 100.00%     3.2K
examples/identifier  LICENSE                 GPL-3.0+ AND MIT                   95.40%      1K
examples/identifier  LICENSE2                MIT AND GPL-3.0+                   99.65%      35K
examples/identifier  has_identifier.py       (MIT OR GPL-3.0+) AND GPL-2.0      100.00%     409B
parsers              constants.go            (MIT OR Unlicense)                 100.00%     4.8M
parsers              formatter.go            (MIT OR Unlicense)                 100.00%     7.8K
parsers              formatter_test.go       (MIT OR Unlicense)                 100.00%     944B
parsers              guesser.go              (MIT OR Unlicense)                 100.00%     9.8K
parsers              guesser_test.go         (MIT OR Unlicense)                 100.00%     3.4K
parsers              helpers.go              (MIT OR Unlicense) AND Apache-2.0  100.00%     2.4K
parsers              helpers_test.go         (MIT OR Unlicense)                 100.00%     1.5K
parsers              structs.go              (MIT OR Unlicense)                 100.00%     679B
scripts              build_database.py       (MIT OR Unlicense)                 100.00%     4.6K
scripts              include.go              (MIT OR Unlicense)                 100.00%     951B

To write out the results to a CSV file

$ lc --format csv -output licences.csv --pathblacklist .git,licenses,vendor .

Or to a valid SPDX 2.1 file

$ lc -f spdx -o spdx_example.spdx --pbl .git,vendor,licenses -dn licensechecker -pn licensechecker .

You can specify multiple directories as additional arguments and all results will be merged into a single output

$ lc -f tabular ./examples/identifier ./scripts

You can also specify files and directories as additional arguments

$ lc -f tabular README.md LICENSE ./examples/identifier
Directory              File               License                        Confidence  Size
                       README.md          NOASSERTION                    100.00%     7.5K
                       LICENSE            MIT                            94.83%      1.1K
./examples/identifier  LICENSE            GPL-3.0+ AND MIT               95.40%      1K
./examples/identifier  LICENSE2           MIT AND GPL-3.0+               99.65%      35K
./examples/identifier  has_identifier.py  (MIT OR GPL-3.0+) AND GPL-2.0  100.00%     428B

What follows is a brief overview of how it currently works.

Licencechecker works by comparing files against the list of licenses supplied by SPDX.org with the addition of the Fair Source License which I needed for my own purposes.

First the command line arguments are checked to see if they refer to file or a folder. The difference between the two is that if it is a folder the whole folder is first scanned to see if it can identify any files which indicate a license. This can be controlled using the argument licensefiles which by default is set to look for filenames which contain the string license, copying or readme. For example the following files would be identified as potentially containing a license.


These are then taken as potential root licenses under which all other files would be marked against.

If there are multiple root licenses then they are treated using OR as through the project is dual licensed. The check for a root license happens in every folder.

When a candidate for a license is found its contents are checked against a list of unique ngrams for licenses. If there any ngrams matching then each license is checked using a vector space to obtain the fuzzy match value for that license. If the value is over the confidence value set which by default is 85% then it is marked as being a match.

If no license is found using the previous step then the file is checked against all licenses using the fuzzy matching. This is because some licenses do not have any unique ngrams allowing identification. Fuzzy matching only looks at the top portion of the file where license headers are expected to exist.

At the end the matches if any are sorted by and the most likely match is returned as the matching license. As such a license file is considered to only contain a single declared license.

Note that only the most recent root licenses are taken into account, so if a project with a root license of MIT has a sub folder with a root license of GPL-2.0+ files in that root folder will be marked as being GPL-2.0+ and not MIT.

For individual files the file is scanned in the same way but in addition is scanned for any SPDX indicators such as,

SPDX-License-Identifier: GPL-3.0-only

Which will take precedence over any fuzzy matching. The indicators must match known SPDX licenses or they will be disregarded.

When finished the license determined is based on the SDPX identifiers if present, fuzzy matching if over the confidence value and then the root license(s). If multiples match they are treated as an AND.

Take for example,

Directory              File               License                        Confidence  Size
./examples/identifier  has_identifier.py  (MIT OR GPL-3.0+) AND GPL-2.0  100.00%     428B

The root licenses were identified as being both MIT and GPL-3.0+ however inside the code itself it has a GPL-2.0 identifier. As such the license of the file is either MIT AND GPL-2.0 OR GPL-3.0+ OR GPL-2.0. The indicator for the license like this is based on SPDX examples and as specified in the SPDX specification https://spdx.org/spdx-specification-21-web-version

Currently licencechecker (lc) does not indicate if a project may be in violation of license requirements. Assuming there is some sort of license compatibility chart in the future this would be something to add at a later point.

There are a few known issues with the current version. License’s are fuzzy matched so its possible someone could fork an existing license and have it be a false positive match. The second issue is that license matches are based on licenses from SPDX and as such may miss some licenses.

If you have read this far thanks for your time. Id love you to try licencechecker and report back any bugs. I would also love to hear how you are using it and with permission put a link in the README with your testimonial and link. If you want to contribute back use GitHub issues. As with anything I put online I am happy to get constructive feedback.

Simply encrypt or decrypt a string using Boto3 Python and AWS KMS

Another one of those things I need to look up every now and then. Below is a snippet of how to encrypt and decrypt a string using Python and KMS in AWS. The interesting thing is that you don’t need to supply the KMS key alias in the decryption portion. So long as whatever role or key you are using can access the key it should work. For the encryption you can either supply the full ARN of the key or the alias so long as you prefix it with alias/

import base64
import boto3

def encrypt(session, secret, alias):
    client = session.client('kms')
    ciphertext = client.encrypt(
    return base64.b64encode(ciphertext["CiphertextBlob"])

def decrypt(session, secret):
    client = session.client('kms')
    plaintext = client.decrypt(
    return plaintext["Plaintext"]

session = boto3.session.Session()
print encrypt(session, 'something', 'alias/MyKeyAlias')
print decrypt(session, 'AQECAINdoimaasydoasidDASD5asd45')

searchcode plexus

Plexus “A combination of interlaced parts; a network.”

For a while I have been neglecting searchcode.com while working on searchcode server. This is of course not an ideal situation and as such I have started working on it again. The following is just a brief list of things I am considering, problems, issues etc…

So back when I first started working on searchcode I wrote it using PHP. For searchcode next (the current version) I rewrote it using Python and Django. I had intented to continue using this stack for the forseeable future when I chose it. I quite like writing Python code and at the time the performance was acceptable. When I started writing searchcode server however I wanted to make it as easy to install as possible. The reason being that I considered it a competitive advantage to have users installing searchcode without any pain and be off and running. At the time I considered writing it in Go or Rust but had little experience with either language. I decided to pull on my sensible pants and chose Java which is a language I had worked with before and had excellent library support. With the 8th version Java had become considerably less painful to work with.

This of course means I had two code bases which do roughtly the same thing. The Python/Django codebase and the Java one. I was porting code from Python into Java and then improving it where possible without backporting. At this point I can confidently say that the Java codebase is much better, with addtional performance (not due to Java) and bug fixes. Its also far more malleable with the static typing making refacotoring a breeze. As such I have started to look at moving searchcode.com over to use the same code that searchcode server uses.

This is not without its faults. There is a lot of code to port over and they both work in very different ways.

Take for example the indexing pipeline.

In searchcode.com the indexing pipeline is a bunch of cobbled together scripts of Python that do some analysis on code but mostly call out to external programs such as cloc and sloccount to determine what language is being used and to get other information. The current pointer to a repository is kept in the database and incremented every time the job is run so it picks up the next repository. There are a few of instances of the programming running and it runs constantly in the background on the lowest nice value.

By contrast searchcode server works around background jobs. There are multiple jobs that perform different tasks. One fires off every 10 minutes adding repositories to be indexed. Another picks up the repositories, checks it out and processes it before putting the resulting files on a queue. Another picks up items from that queue and then indexes them. They run on low thread priority with checks to ensure that they don’t impact search performance.

To move from the former to the latter requires effectivly a rewrite. This is because on searchcode.com the code is stored in a central database whereas searchcode server keeps a local copy at all times. Also searchcode server performs all the code analysis itself using its own database of languages and other tricks.

Recently I went through a large clean up of the searchcode server codebase. In doing so I kept in the back of my mind that I should consider making things such that I could move searchcode.com over to use it. As such I decoupled where appropiate, merged where it would make sense and effectively laid the ground work for the work to come. With that release pushed out I am not starting to looking to once again rewrite a large chunk of searchcode.com such that it shares the same codebase.

So the things I want to improve as a list when doing this are the following.

Indexing speed
Currently the process takes literally weeks for a project to be indexed. I want to cut this down to as short a time as possible. This was due to a few reasons. The first was that I could only run a single process on each server I had. While it was possible to run multiple I noticed that it ran into issues such as overusing resources. With tight integration into the codebase I can have the processing throttle back when more important work such as a user doing a search is running. Another issue was that the external programs I was using occasionally would time out or hang. By moving that code internally I can improve searchcode server as a product and have greater control.

Real Time Index
When I first started searchcode.com I was using Sphinx with an Index + Delta scheme for updates. I moved this over to a periodic full index due to performance issues. This means that anything added to the database can take a week before it is searchable. So even if I had fixed the indexing speed issues I was still going to be hobbled by how long it took to update the index. By using sphinx real time indexes I can speed this up and have things available to search the moment they are ready. It also means that deletes which happens when people accidently publish the wrong thing accidently can be cleaned up more effectively.

Language Support
The current version of searchcode.com has some hacks on top of the perl command line application cloc to identify languages beyond what cloc supports. It works well enough but its nice to be able to control this. Since the language identification is build into searchcode server this is a rather nice bonus.

While the majority of queries and requests are returned fairly quickly some outliers take multiple seconds and some time out totally. I did some profiling of the queries themselves and it appears that the majority return in under a second. This suggest that most of the slowness comes down to the parts outside of sphinx. One thing I was concious of when writing searchcode server was to ensure that it was fast to process results. There were a lot of optimisations that went into the code to help with this. This was actually the bulk of the code that I did not want to backport into the exisitng Python codebase. There is still performance work that needs to go into this but I feel that by having a unified codebase with large amounts of usage it will be easier to catch these edge cases and resolve them more quickly.

Addtional Filters
I wanted to add addtional filter types to the frontend. Things what would be nice would be the file owner IE who was the last person to touch the file. Other things such as the cyclocmatic complexity of the file would also be useful.

With all the above in mind the work has already started. You can view the progress on https://github.com/boyter/searchcode-server/tree/searchcode and while there is still a long way to go at this point I have already got the indexing working along with search and a few other things. Still a long journey to go but its nice to see it coming along. I will be posting more updates in the future, so watch this space!

BBQ with a Dutch ICT/start up Delegation

So a few weeks back I received an interesting email that at first thought looked like a scam.

In short I was being invited to a BBQ at the Consul General’s residence in Sydney to meet an ICT/start up delegation from the Netherlands. Let me straight up say I have no idea why I would receive something like this. I am not dutch, have no dutch ancestry (that I am aware of) and while searchcode.com somehow is listed on the top startups in Sydney somehow its not as though it has lots of press coverage or is a real start-up (its more a hobby side business at the moment).

You can view the email above. I think its the use of _ in the title that triggered my personal spam filter. However it was not flagged as such by Gmail, the headers all looked fine and when I followed through the various links it all seemed above board. I posted it in the company slack channel asking what others thought. There was a bit of debate, but I figured I would follow it up on twitter with the person who sent to email.

They actually responded back that yes it was real and if I could respond to the email. Which I did.

A short while later and the details were sent though. Monday 16th October 2017 at an address in Seaforth (swanky!). There was some discussion about if this was indeed a real event and that I should have an escape plan.

Assuming it was real I had two goals. The first was to find out why I was invited. The second was to find out why they were interested in Australia.

The day came around and I took a taxi driven by a racist who hates uber (the guy driving and his stories are worth his own blog post) and I arrived.

I had no expectations going in. I have never attended anything like this in my life. My plan was to just roll with the punches and see what I could learn and perhaps meet some interesting people.

Have you ever had that moment where you realised not only are you the most poorly dressed but you also just dragged down the average age by being an outlier? It was one of those meetings. I quickly noticed that I in jeans and hoodie was really underdressed by most who were wearing suits. I guess I could have played it off as though I was the next Zuckerberg though. I was welcomed by Willem (the Consol) a very nice chap who greeted me in what I presume was Dutch but frankly could have been English as I was so bewildered. He swiftly invited me in and offered me a beer. At the time I was more thirsty and requested water, and pointed out tap water was fine. Probably a mistake on my part. Here I was in a fancy house/mansion and asking for a glass of tap water.

Anyway after some brief chatting I noticed Nicoline who was the one who invited me. Not familiar with Dutch culture and frankly not being good at reading people it was hard to see what she thought of me. I suspect surprise at my appearance and age perhaps? I can only apologise for not being more impressive. I didn’t get the chance to say hello at this point as I was ushered outside.

The view of the house was excellent and while the backyard was small due to the presence of a pool in the middle of it it was quite pleasant. After walking outside I noticed that Paul Budde just happened to meander outside. I did a double take for a while before realising it was indeed him, especially when he pronounced his last name. Spotting some individuals more my own age I went over and introduced myself. Turns out they were interns working at the consulate and dragged into cooking duties. I ended up chatting to them for a while learning that Holland exports quite a lot of natural gas but is also starting to stop this process due to it causing the land to sink further.

Around this time Willem gave a quick speech. I got the impression that the Dutch look at Australia as a missed opportunity, considering that they found it 100+ years before the English but never did anything with it. They were quick to point out that it was called New Holland for a time. Highlight of the speech was the arrival of a Kookaburra who was eyeing off the BBQ goods and the surprise of the visitors to Australia at the native fauna.

Following the speech there was a “round table” introduction of who you were and why you where there.

Awkward. Thats the word that best describes me at this point. After Paul Budde, everyone else was Directors of companies, CEO’s, Founders, Co-Founders etc… At my time to say who I was I decided to go with the technique I believe is called a purple elephant. I mentioned who I work for (Kablamo) what I do (programmer working with AWS) and that I run searchcode.com on the side as well. I finished by saying “And I don’t like to eat raw tomato”. No idea why I use that but it tends to be easy to remember and makes you stand out. No longer was I the uncomfortable poorly dressed guy, I was the uncomfortable poorly dressed guy who doesn’t like to eat raw tomato. It did prove a talking point when the salads were being served with a few individuals pointing out I should steer clear of the middle one due to the presence of chopped tomato.

Following the round table there was a call for a group photo. As such there is some photo of an ICT startup delegation in the backyard of the Dutch consul generals house with a scruffy Australian in the back. I made sure to place myself in it such that it would be easy to photoshop me out. I’d love to get a copy of the photo though.

I figured at this point I would attempt to mingle, went to grab a beer and with that in hand start chatting away. Paul Budde of https://www.budde.com.au/ who I have read a lot of his work regarding one of my pet hates in Australia the mismanagement of the NBN. I decided there and then to have a conversation with him. After a swig of liquid courage I walked up offered to shake his hand and mentioned I was a big fan of his work, and especially interested in his comments about the NBN. I managed to hold a 15 minute conversation with him about the NBN without feeling like an intellectual numpty. Interesting takeaways from the conversation were that he thinks it will be 2030 before anything is done to fix the current situation, and that he has talked to the Prime Minister Malcom Turnbull on it and he believes that Turnbull 100% knows he is doing the wrong thing but has to continue for political reasons. I managed to suggest a beer to him from the selection available which he agreed was a good choice after tasting it.

Following this the party moved inside around the kitchen. I ended up at this point chatting to a few people the main one being Ruud Alaerds the director of Dutch Hosting Provider Association DHPA. He asked me about how hosting companies are in Australia and how they are going with being acquired. He was interested in know if the local companies are merging into a super-company and pointing out that companies such as OVH are probably waiting for most of the local companies to gobble each other up and then acquire the winner. I believe he is looking to get in before this. He was also able to answer my second question of why Australia? Turns out Australia is seen as a good stepping stone into Asia as Singapore is already “full” and the other places such as Taiwan and Hong Kong are not worth investigating at this point.

Other conversations I had were interesting. Things like discussions about houses in Double Bay (very swanky) lead to some asking where I live (near Berowra) and after explaining where it is there were comments such as “Oh that must be nice for you”. Which in fairness by comparison it is a very humble place to live.

Eventually the desert was offered (small cheesecakes) a small thank you speech and people started leaving. At this point I went for the direct approach. I walked up to Nicoline and thanked her for the invitation and again asked why was I invited. Her response was not really encouraging, just that she was looking around for local people involved in the start-up community in Australia. I guess that since I have worked for start-ups and have what is considered one that might be why I popped into her list, but I suspect ultimately it was more a mistake. I did have an enjoyable evening regardless!

All in all Boyter’s uncomfortable night out with the Dutch ICT/start up Delegation was quite entertaining. While I didn’t get any actionable items out of it I did enjoy meeting Paul in particular and seeing how people who are considerably more successful that I am live.

8/10 would do again. Much thanks to the very kind individual who gave me a lift out of there as well which got me home before 11pm. I still believe I was invited by mistake and that while there I was more of a dancing monkey than someone who was able to contribute but none the less it was very interesting for myself.

Working with Rust

For a while I have been wanting to play with a new programming language. Professionally I work with Java, C#, Python and JavaScript and so I went looking for something complimentary to them.

My first stop was was Go. I started by implementing a vector space search in it which you can see here https://github.com/boyter/golangvectorspace

While I liked the syntax (I still cannot think in Go), the libraries and the performance I realized that Go is close to a hybrid of Java and Python for me. It has the performance of Java with the Python style syntax. This is fine but I really wanted to push what I already am familiar with.

I decided I would have a go at a much lower level language. This mean either C, C++, D or Rust. I had a quick tinker with D and while I liked it it felt a little like C#. Rather than go with C or C++ I decided to try Rust.

Not being familiar at all with Rust I decided rather than start with building a Vector Space in it I would run through a few of the Project Euler problems.

// If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
// Find the sum of all the multiples of 3 or 5 below 1000.
// Answer 233168
fn euler_1() -> i32 {
	let mut sum = 0;

	for i in 1..1000 {
		if i % 3 == 0 {
			sum += i;
		else if i % 5 == 0 {
			sum += i;


The above is my answer to problem 1 using Rust. Pretty simple really and I only had to look up a few things from the guide. The below answer to question two was however a little more complex. I do like the functional approach iteration. While the below works, I suspect that it could be more idiomatic if rather than looping for the answer that I could chain a filter in there somehow.

// Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:
// 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
// By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.
// Answer 4613732
fn euler_2() -> i32 {
	let mut v: Vec = Vec::new();

	let mut sum: i32 = v.iter().rev().take(2).sum();
	while sum < 4000000 {
		sum = v.iter().rev().take(2).sum();

	let mut answer: i32 = 0;
	for val in v.into_iter() {
		if val % 2 == 0 {
			answer += val


EDIT – Made it more idiomatic

// Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:
// 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
// By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.
// Answer 4613732
fn euler_2() -> i32 {
	let mut v = vec![1,2];

	let mut sum: i32 = v.iter().rev().take(2).sum();
	while sum < 4000000 {
		sum = v.iter().rev().take(2).sum();

	let answer: i32 = v.into_iter().filter(|x| x % 2 == 0).sum();

My plan is to eventually work my way up to solving Problem 35 which was posted as one of the weekly quiz questions at work the other day. Hoping that with some free time I will be able to dip into Rust over the next few weeks. You can watch my progress if so inclined on Github https://github.com/boyter/working-with-rust

Design for searchcode server

A very brief update about the progress of searchcode server. Currently I am in the middle of reworking how the index is built and maintained. The idea being I want to add zero downtime index rebuilds which requires a blue/green index strategy. It is still very much in flux but the current design is to merge the indexer and searcher which should allow this to happen. I have been playing around with using an iPad as a production device these days and produced the following document.

Edit. People keep asking me what App I used to create this. It was made Pureflow for iOS and then exported.

Home Battery Systems – You may de-rate system capacity

A quick post. I was looking at the prices to install a home battery system the other day. During the sales process I was informed that for a modular system such as the one provided by Ampetus that if you don’t buy 2 or more batteries that

“you may de rate system capacity”

Sound’s impressively scary! Slightly worried I asked that it means. Thankfully what it actually means is that instead of getting 3kW delivery using two batteries that one battery matched with the inverter may only deliver 1.5kW. This is not a problem if you are still connected to the grid but it can mean that your inverter isn’t working at its full potential. So in short nothing to worry about.

How to identify software licenses using Python, Vector Space Search and Ngram Keywords

EDIT – I have since taken the ideas below improved them and released a command line application you can use to build software license reports https://github.com/boyter/lc/

The below is mostly a log of my thought process while building out some functionality that I wanted to add into searchcode server. I kept a record of progress and thoughts while doing this in the hopes that I get some sort of useful blog post out of it. It has been edited somewhat for clarity.

One of the tickets raised for searchcode server is the ability to to filter by licence https://github.com/boyter/searchcode-server/issues/96 yes, I am the one who raised the ticket, but it is based on requests from customers. It will also put searchcode server closer to feature parity with Krugle so it seems like a good thing to add.

This means that ideally all I want is a script that given a projects directory and a filename can tell me what licenses it is likely to be under. I can then port that over to searchcode-server and its Java codebase at my leisure.

My first thought was that adding it for say the top 20 most popular licenses shouldn’t be too difficult, and its something that I could always expand on later if things are working well.

As such the first thing I needed was to determine the top software licenses in use which thankfully Blackduck Software has already done https://www.blackducksoftware.com/top-open-source-licenses

Then I needed to get a copy of them and in a nice format such as JSON to allow for processing. I decided to source them from SPDX.org since generally they should be considered the source of truth for license information. https://spdx.org/licenses/

I then wrote a simple script download.py to pull all the licenses down (yes using regex to pull information out of HTML which is BAD but I am not trying to parse it so I am pretty sure he won’t come) locally and allow for further processing. Sorry SPDX people about crawling your site in a non robots compliant way. I did try to get what I needed and cache it locally so I suspect I impacted the site less than Googlebot would have.

I also added a simple filter to pull back the top 20+ licences that we previously identified so we can test things out.

The next step was to turn the HTML into JSON. Again an ugly script parse.py which calls the beast by using regex to pull information out of HTML. One thing I did notice is that the Fair Source licence was missing. Since I was planning on using searchcode server as a test bed I needed that and added it myself.

Once again sorry to the SPDX people. If you come to Sydney and ill buy you a beer and apologize for,

  1. For creating my own short-name for the Fair Source License without permission
  2. For the crappy internet you will experience (Seriously search for Turnbull NBN if you want to see how backwards a country can become, copper for life yo)

The result is now we have a database of about 20 licences that we can use to try and determine what licence a software project is.

Attempt 1

To start with I made the following assumptions.

A file with a name such as LICENSE or COPYING is likely to exist in the root folder of any project and contain license information. If not have a look inside the readme (if it exists). This file if found and it has a license that we can identify then it becomes the base license for all files inside the project. However files inside the project itself may have a header which must also be considered.

One thing that I didn’t consider at first is that there may be another license/copying file somewhere deeper inside the file tree. Something to consider for later.

First thought was to just use the vector space search algorithm. It really is my hammer for every problem. It works reasonably well for a lot of them, is fast enough in most cases and generally gets the job done. Thankfully I had already written one for Python a while ago. Don’t hate please, I wrote this 10 years ago when I was first learning it and yes its not Pythonic but works. One thing to note is that licenses tend to be of different length. This means that licenses that are closer to each other in length will be matched more closely using the vector space which is a cool side effect of how it works.

So the algorithm is,

Find likely candidates for license files.
Compare them to the list of known licenses using the vector space model.
Keep the most likely match.

Then walk through every file in the repository, checking for the special header and when there are no matches try the full header because things like the entire MIT license tends to get included at the top of the file. The resulting ugly script attempt1.py produced the following.

The result for searchcode-server

Project License
0.929696395964 Fair Source License v0.9
0.802095284986 Mozilla Public License 2.0 (no copyleft exception)
0.802095284986 Mozilla Public License 2.0

0.925341443302 MIT License /searchcode-server/src/main/resources/public/js/cache.js

Wow. That is a very cool result. It actually worked. It not only picked up that we are using the Fair Source Licence it also picked up that that one file is using the MIT license. Lets try it out on some other projects.

Against wordpress,

Project License
0.623430083878 GNU General Public License v2.0 only
0.614318516008 GNU General Public License v1.0 only
0.601642491832 GNU Library General Public License v2 only

While it did pick up that its probably using GNU GPL v2.0 but it wasn’t as confident as it was with the previous. Lets try another one.

Against minitwit (a Spark Java example project)

Project License
0.954897366777 MIT License
0.784597744861 Fair Source License v0.9
0.777231345803 Apache License 2.0

Not bad. Its pretty confident that it us under the MIT license.

Against Armory React

Project License
0.945769202843 BSD 3-clause Clear License
0.937649791859 BSD with attribution
0.927894236317 BSD 2-clause "Simplified" License

Again pretty good.

Ok. So it looks like we could just pop the top license off and call it a day. This works pretty well with the most common licenses, but why limit ourselves. Lets just do every license that SPDX has listed? It should work just as well in theory. All we need to do is remove the filter in download.py and rebuild the database and try again.

Trying out on searchcode-server again

Project License
0.929696395964 Fair Source License v0.9
0.813818153434 OCLC Research Public License 2.0
0.804549095187 OSET Public License version 2.1

0.977617793941 BSD Zero Clause License /searchcode-server/include/license/database.json
0.939606278132 Attribution Assurance License /searchcode-server/include/license/database.json
0.908192569643 Open CASCADE Technology Public License /searchcode-server/include/license/database.json
0.902275136399 Adaptive Public License 1.0 /searchcode-server/include/license/database.json
0.93217139424 JSON License /searchcode-server/src/main/resources/public/js/cache.js
0.925341443302 MIT License /searchcode-server/src/main/resources/public/js/cache.js
0.914039614281 feh License /searchcode-server/src/main/resources/public/js/cache.js

Ok so it still picked up fair source as the main project license which is a good result. However our very cool result of MIT being found in cache.js has gone away. Apparently the JSON license looks like the MIT license to the vector space. Indeed a diff between them shows that they are almost the same. The differences being right at the start,

MIT License Copyright (c)
JSON License Copyright (c) 2002 JSON.org

and buried in the middle of the JSON license

The Software shall be used for Good, not Evil.

Hah! I remember reading about that a while ago. Something about Google not being able to use it because apparently their motto “Don’t be evil” is more a guideline then a rule. In all seriousness though its actually due to the license being classified as a non-free license because it imposes conditions which restrict the usage see https://www.cnet.com/news/dont-be-evil-google-spurns-no-evil-software/ for more details.

So what we would normally do about now is add keyword weighting to the terms so that in this case MIT makes it rank higher for MIT and JSON for JSON. In fact I started doing just that with keyword it then realised with 328 licenses this is going to be a painful process. Perhaps there is a better way.

Thinking about it what we really want is to find keywords or a collection of multiple keywords that we know to be unique for each license. Then all we need to is check the text for the presence of those keywords. The catch being we need to ensure that they are unique for each license. To do so what I think will work is break the license up into collections of works of length 1-10 and check for unique-ness against the other licenses. The technical term for these terms is ngram.

An example would be,

Lorem ipsum dolor sit amet consetetur sadipscing elitr


[('lorem', 'ipsum'), ('ipsum', 'dolor'), ('dolor', 'sit'), ('sit', 'amet'), ('amet', 'consetetur'), ('consetetur', 'sadipscing'), ('sadipscing', 'elitr')]


[('lorem', 'ipsum', 'dolor'), ('ipsum', 'dolor', 'sit'), ('dolor', 'sit', 'amet'), ('sit', 'amet', 'consetetur'), ('amet', 'consetetur', 'sadipscing'), ('consetetur', 'sadipscing', 'elitr')]

Thankfully this is pretty easy to do in Python so I borrowed an existing bit of code to do it for me, http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/

input_list = ['all', 'this', 'happened', 'more', 'or', 'less']

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

For the record I am totally aware that NTLK can also do this but since I don’t currently have that installed lets go pure Python. It will be a little slower but considering this should rarely run this calculation I am not too worried about performance yet. Of course thats a good idea to live by anyway. Only worry about performance when it becomes an issue.

We can then generate ngrams for each license, then check for its uniqueness in every other one. If no matches found then woohoo we have a gram that uniquely matches the license.

Some simple but stupid code was written parse2.py which does exactly this. Turns out that language is a lot more distinctive for licenses then I first thought,

0BSD 188
AAL 1985
Abstyles 437
Adobe-2006 1234
Adobe-Glyph 1620
ADSL 608
AFL-1.1 555
AFL-1.2 251
AFL-2.0 69
AFL-2.1 67
AFL-3.0 452
Afmparse 959
AGPL-1.0 2212

For the BSD Zero Clause License there are apparently 188 unique ngrams between a length of 2-10 words in it. For the Affero General Public License v1.0 there are a whopping 2212! The numbers were so high that I changed the ngrams to start at 5 to 10. This dropped the numbers found by about 25% which seems about right as you would expect the most unique combinations of words to exist at the upper range of ngrams.

One problem I noticed with this is that a lot of the ngrams are based on names that exist within the sample licenses that SPDX has. For example BSD Zero Clause has the name “Rob Landley” which produces a lot of ngrams with this name in it as is indeed unique, but is useless unless we happen to be checking code that Rob has written.

However the performance issue that I was worried about before popped up. it was taking a long time to process. Not surprising considering the actual implementation consists of multiple nested loops. With the encouraging result that there is a lot of uniqueness for most licenses I tried just searching for ngrams of 4-5 words long to speed things up. Assuming we didn’t find 0 matches then happy days we can try implementing using what we have and everything should be fine. A few small tweaks later and I ran it again.

Doh! As it turns out some are not unique enough. The culprits,


I modified the code to just loop those ones with a more exhaustive search to find what worked. With ngrams of length 2-10 still there was not enough uniqueness. So I went all out, ngrams from length 2-35.

Artistic-1.0 120
BSD-3-Clause 21

This resolved the issue for Artistic-1.0 and BSD-3-Clause but we still have nothing for the following licenses,


Something is wrong here.

Turns out some of these are particularly troubling. The Mozilla Public License 1.1 (MPL-1.1 in the above) for example seems to be embedded  in the Netscape Public License v1.1 which of course means nothing about it is unique compared to the other. The others have a similar story. I tried searching just between both those licences with ngrams of length 2-100 to see if anything would turn up, which of course took a crazy amount of time to process. The result was that there was nothing that could be used keyword wise to seperate these licenses.

How about a hybrid approach?

If we cannot definitively say using keywords what license the file is, then we can fall back to the vector space to rank based on those without keywords. To do so I flushed the results of the last run into the database file using 7-8 ngrams for everything except Artistic-1.0 and BSD-3-Clause which checked for ngrams with a range of 2-35. The results of this produced the following licenses with their number of unique ngrams,

0BSD 25
AAL 246
Abstyles 54
Adobe-2006 163
Adobe-Glyph 210
Fair-Source-0.9 267

Most seem to have over 100 or so unique ngrams which means they will probably work pretty well, and as expected the only exceptions are the MPL licenses which has nothing unique about it compared to every other license.

I ended up cleaning up the code at this point and improving on the loops so thankfully it took only a few minutes to run. Some of the original runs were taking tens of minutes due to inefficient code so this was a big win. The resulting file was on the heavy side though at 20 megabytes and crashed most editors when I tried to read it. To resolve this I truncated down to at most 50 unique ngrams per license to see how this worked in the real world. This file weighed in a much more realistic 3 megabytes.

I then created attempt2.py which used the new database using keywords to guess which license the applications had applicable. Firstly I tried it against searchcode-server itself. With the result that the LICENCE file found was indeed the Fair Source License. I then tried it against GCC. Its interesting to note that this resulted in its COPYING file being marked as containing both the GPL-2.0 and LGPL-2.1. At first I thought this might have been a bug in my logic but it seems it was actually correct. The offending ngram used to match was

"street, fifth floor, boston, ma 02110-1301 usa"

Which is supposed to be unique to LGPL-2.1 but included in this case. Since we actually have 813 (but truncated to 50) ngrams for each I figured we might as well when checking see if MOST (70%) of the keywords are there, and if so mark it as a match otherwise ignore. This resutled in the following for GCC


Which is correct. In fact further tests on various other repositories worked until I hit the react-armory which is under the BSD Clear License. Turns out most of the ngrams generated for it actually involve the project name meaning they are useless. Annoyingly the ngram of length 3 “BSD Clear License” which is unique was missed because the parser was set to look from 7-8 ngrams. Urgh.

Thinking about this for a bit, it totally makes sense to look for ngrams from 2-8 even when we are truncating to the top 50. The smaller ones are going to be high value ones usually and there shouldn’t be too many of them. The only problem is the increased processing time to walk all of the combinations of things like [‘the’, ‘license’] and the like. At the very least we should include trigrams (3 length ngrams) since it solves this specific issue and should work to be unique for a lot of licenses. Modifying the parse2.py script to take in a range of numbers defined was easy enough then just let it run and produce the new database to work with and try everything again. It took longer than before but not so much that it was annoying.

Due to how painful this was getting to test manually I started to build a small test harness (test_attempt2.py) which I could eyeball to see if I was getting closer to a result I wanted without regressions. Thankfully it was working perfectly for the tests I was trying.

At this point I tried the same technique across all the files in repositories. Using the previous example of searchcode-server I would have expected that the cache.js file would be marked as MIT license as before. This however was not the case. Because I had the keyword match script as a sliding scale where it needed a few matches to be considered positive it was missing out on this one as there were only a few matches.

I decided at this time to integrate the Vector Space back in. If a file was marked with any license, we would then confirm using the Vector Space and if it was able to have a high degree of confidence over the license that matched then it would be marked as the license. After adding this logic I managed to get the following output,

Bens-MBP:license boyter$ python attempt2.py
0.988840442861 Fair-Source-0.9 /searchcode-server/LICENSE.txt
0.442713859889 MIT /searchcode-server/src/main/resources/public/js/cache.js

Exactly what I was looking for. All of the tests I had previously written also passed with this logic. With a simple change to only check the top of the file where the header would be by cutting the string down to the length of the license I was able to improve the result considerably,

Bens-MBP:license boyter$ python attempt2.py
0.988840442861 Fair-Source-0.9 /searchcode-server/LICENSE.txt
0.985719098155 MIT /searchcode-server/src/main/resources/public/js/cache.js

That is a seriously cool result. Not only was the code able to identify the licences, it did so with a very high percentage of confidence to boot.

I tried running it over a collection of projects I had checked out including GCC, Linux and WordPress

 0.993261478239 BSD-3-Clause-Clear /armory-react/LICENSE
 0.999679612780 GPL-3.0 /decodingcaptchas/LICENSE
 0.980796066053 OFL-1.1 /decodingcaptchas/lib/font/source-sans-pro/LICENSE
 0.999692799354 GPL-3.0 /freemoz/LICENSE
 0.999792347852 GPL-2.0 /gcc/COPYING
 0.999922926136 LGPL-2.1 /gcc/COPYING.LIB
 0.999679612780 GPL-3.0 /gcc/COPYING3
 0.999902746595 LGPL-3.0 /gcc/COPYING3.LIB
 0.999792347852 GPL-2.0 /gcc/gcc/COPYING
 0.999932589474 LGPL-2.1 /gcc/gcc/COPYING.LIB
 0.999679612780 GPL-3.0 /gcc/gcc/COPYING3
 0.999902746595 LGPL-3.0 /gcc/gcc/COPYING3.LIB
 0.858896713823 GPL-2.0 /gcc/gcc/df-core.c
 0.999532020106 GFDL-1.3 /gcc/gcc/ada/doc/share/gnu_free_documentation_license.rst
 0.948529328513 bzip2-1.0.6 /gcc/gcc/testsuite/gcc.c-torture/execute/pr20527-1.c
 0.994200247145 bzip2-1.0.6 /gcc/gcc/testsuite/gcc.dg/params/LICENSE
 0.999792347852 GPL-2.0 /gcc/include/COPYING
 0.999679612780 GPL-3.0 /gcc/include/COPYING3
 0.889767272339 bzip2-1.0.6 /gcc/libbacktrace/alloc.c
 0.899838703127 bzip2-1.0.6 /gcc/libbacktrace/atomic.c
 0.883378984629 bzip2-1.0.6 /gcc/libbacktrace/backtrace.c
 0.885232659466 bzip2-1.0.6 /gcc/libbacktrace/btest.c
 0.866517942966 bzip2-1.0.6 /gcc/libbacktrace/dwarf.c
 0.881126480326 bzip2-1.0.6 /gcc/libbacktrace/elf.c
 0.876148037216 bzip2-1.0.6 /gcc/libbacktrace/fileline.c
 0.862314849462 bzip2-1.0.6 /gcc/libbacktrace/mmap.c
 0.876559695369 bzip2-1.0.6 /gcc/libbacktrace/mmapio.c
 0.903997364404 bzip2-1.0.6 /gcc/libbacktrace/nounwind.c
 0.880387358104 bzip2-1.0.6 /gcc/libbacktrace/pecoff.c
 0.872402129061 bzip2-1.0.6 /gcc/libbacktrace/posix.c
 0.881550515104 bzip2-1.0.6 /gcc/libbacktrace/print.c
 0.878971997218 bzip2-1.0.6 /gcc/libbacktrace/read.c
 0.880857671076 bzip2-1.0.6 /gcc/libbacktrace/simple.c
 0.897738871764 bzip2-1.0.6 /gcc/libbacktrace/sort.c
 0.890041393843 bzip2-1.0.6 /gcc/libbacktrace/state.c
 0.874782494001 bzip2-1.0.6 /gcc/libbacktrace/stest.c
 0.897023027383 bzip2-1.0.6 /gcc/libbacktrace/unknown.c
 0.879821572348 MITNFA /gcc/libffi/src/mips/ffi.c
 0.997055163924 LGPL-2.1 /gcc/libiberty/copying-lib.texi
 0.999932589474 LGPL-2.1 /gcc/libiberty/COPYING.LIB
 0.899167675944 Intel /gcc/liboffloadmic/runtime/liboffload_error.c
 0.889543866358 Intel /gcc/liboffloadmic/runtime/liboffload_msg.c
 0.887357763796 Intel /gcc/liboffloadmic/runtime/orsl-lite/lib/orsl-lite.c
 0.999932589474 LGPL-2.1 /gcc/libquadmath/COPYING.LIB
 0.857337014417 NCSA /gcc/libsanitizer/LICENSE.TXT
 1.000000000000 BSL-1.0 /gcc/zlib/contrib/dotzlib/LICENSE_1_0.txt
 0.998470560414 GPL-2.0 /linux/COPYING
 0.999831352903 LGPL-2.0 /linux/arch/sparc/lib/COPYING.LIB
 0.997101938433 GPL-2.0 /linux/Documentation/networking/LICENSE.qlcnic
 0.997101938433 GPL-2.0 /linux/Documentation/networking/LICENSE.qlge
 0.997004626197 GPL-2.0 /linux/Documentation/scsi/LICENSE.qla2xxx
 0.997004626197 GPL-2.0 /linux/Documentation/scsi/LICENSE.qla4xxx
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxx/adf_c3xxx_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxx/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxxvf/adf_c3xxxvf_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c3xxxvf/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62x/adf_c62x_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62x/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62xvf/adf_c62xvf_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_c62xvf/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_accel_engine.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_admin.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_aer.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_cfg.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_ctl_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_dev_mgr.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_hw_arbiter.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_init.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_isr.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_pf2vf_msg.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_sriov.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_transport.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_transport_debug.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_vf2pf_msg.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/adf_vf_isr.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_algs.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_asym_algs.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_crypto.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_hal.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_common/qat_uclo.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xcc/adf_dh895xcc_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xcc/adf_drv.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xccvf/adf_dh895xccvf_hw_data.c
 0.900564102581 Intel /linux/drivers/crypto/qat/qat_dh895xccvf/adf_drv.c
 0.998958351972 GPL-2.0 /linux/drivers/staging/rtl8192e/license
 0.999594583136 GPL-2.0 /linux/drivers/staging/rtl8192u/copying
 0.999792347852 GPL-2.0 /linux/tools/usb/usbip/COPYING
 0.952930843666 Sleepycat /pathos/dill-0.2.1/LICENSE
 0.952930843666 Sleepycat /pathos/pathos-master/LICENSE
 0.952930843666 Sleepycat /pathos/pox-0.2/LICENSE
 0.987991559152 MIT /phonecat/LICENSE
 1.000000000000 Apache-2.0 /python-goose/LICENSE.txt
 1.000000000000 Apache-2.0 /searchcode/searchcode/searchcode/static/admin/fonts/LICENSE.txt
 0.987991559152 MIT /searchcode/searchcode/searchcode/static/admin/js/vendor/xregexp/LICENSE-XREGEXP.txt
 0.988840442861 Fair-Source-0.9 /searchcode-server/LICENSE.txt
 0.985719098155 MIT /searchcode-server/src/main/resources/public/js/cache.js
 0.985719098155 MIT /searchcode-server/target/classes/public/js/cache.js
 0.999268167835 LGPL-2.1 /seo/vendor/videlalvaro/php-amqplib/LICENSE
 0.999692799354 GPL-3.0 /SingleBugs/LICENSE
 0.999692799354 GPL-3.0 /testing/LICENSE
 0.997865652460 GPL-2.0 /wordpress/license.txt
 0.999792346238 GPL-2.0 /wordpress/wp-content/themes/twentyfourteen/genericons/LICENSE.txt
 0.999792346238 GPL-2.0 /wordpress/wp-content/themes/twentythirteen/fonts/LICENSE.txt
 0.999792346238 GPL-2.0 /wordpress/wp-includes/js/plupload/license.txt
 0.999932589474 LGPL-2.1 /wordpress/wp-includes/js/tinymce/license.txt

Rather cool. Lots of licenses identified with quite a lot of confidence.

There is however one issue I am overlooking. Take for example a project which includes another project in a sub directory. What license should be displayed for the files that are in the latter? In this case I would expect that they are marked with both. It would be very nice to see this information.

One final issue is what to do when there are multiple license files in the root directory. GCC is an example of this and it has 4 licenses defined in the root. The idea being you take which one is most applicable to your project.

0.999792347852 GPL-2.0 /gcc/COPYING
0.999922926136 LGPL-2.1 /gcc/COPYING.LIB
0.99967961278 GPL-3.0 /gcc/COPYING3
0.999902746595 LGPL-3.0 /gcc/COPYING3.LIB

Looking at the SPDX specification the rule is to define them as being under all. This is the exact text regarding this,

Representing Multiple Licenses

Multiple licenses can be represented using a SPDX license expression as defined in Appendix IV. A set of licenses must be enclosed in parentheses (this is a convention for SPDX expressions). As further described there:

When there is a choice between licenses (“disjunctive license”), they should be separated with “OR”. If presented with a choice between two or more licenses, use the disjunctive binary “OR” operator to construct a new license expression.
Similarly when multiple licenses need to be simultaneously applied (“conjunctive license”), they should be separated with “AND”. If required to simultaneously comply with two or more licenses, use the conjunctive binary “AND” operator to construct a new license expression.
In some cases, a set of license terms apply except under special circumstances, in this case, use the “WITH” operator followed by one of the recognized exception identifiers.
Sometimes a set of license terms apply except under special circumstances. In this case, use the binary “WITH” operator to construct a new license expression to represent the special exception situation.

SPDX-License-Identifier: (GPL-2.0 OR MIT)
SPDX-License-Identifier: (LGPL-2.1 AND BSD-2-CLAUSE)
SPDX-License-Identifier: (GPL-2.0+ WITH Bison-exception-2.2)

This was taken from https://spdx.org/spdx-specification-21-web-version Appendix V: Using SPDX short identifiers in Source Files

Fair enough. We then need to write one last version of our license guesser which will use everything done above BUT support multiple licenses and look for license files inside root directories and make anything below it as belonging to that one, unless of course there is another license file or it is over-ridden by the header. We are not trying to keep track of the above WITH or AND logic described above as I just want to mark each file with the potential license. As such just marking it so is good enough.

For this I simply copied the existing attempt2.py and created attempt3.py which should hopefully be the last version for this. What we now need to do is walk the tree, and every time we encounter a license file, keep track of the license that the file should be under, and for any file that is a child of that license mark as such. When we leave the directory then we pop the license off.

Thankfully this was a very easy change to implement. I used a project which I knew to already have some mixed licenses and ran over it.

 >>>>>>> /decodingcaptchas/LICENSE GPL-3.0
 >>>>>>> /decodingcaptchas/LICENSE GPL-3.0
 >>>>>>> /decodingcaptchas/LICENSE GPL-2.0
 >>>>>>> /decodingcaptchas/LICENSE GPL-2.0
 /decodingcaptchas/Gruntfile.js GPL-2.0 GPL-3.0
 /decodingcaptchas/code/crack.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/crack_test.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/geticons.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step1.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step2.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step3.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/step4.py GPL-2.0 GPL-3.0
 /decodingcaptchas/code/style.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/reveal.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/print/paper.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/print/pdf.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/beige.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/black.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/blood.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/league.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/moon.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/night.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/serif.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/simple.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/sky.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/solarized.css GPL-2.0 GPL-3.0
 /decodingcaptchas/css/theme/white.css GPL-2.0 GPL-3.0
 /decodingcaptchas/js/reveal.js GPL-2.0 GPL-3.0
 >>>>>>> /decodingcaptchas/lib/LICENSE Fair-Source-0.9
 /decodingcaptchas/lib/css/zenburn.css Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/lib/font/league-gothic/league-gothic.css Fair-Source-0.9 GPL-2.0 GPL-3.0
 >>>>>>> /decodingcaptchas/lib/font/source-sans-pro/LICENSE OFL-1.1
 >>>>>>> /decodingcaptchas/lib/font/source-sans-pro/LICENSE OFL-1.1
 /decodingcaptchas/lib/font/source-sans-pro/source-sans-pro.css Fair-Source-0.9 GPL-2.0 GPL-3.0 OFL-1.1
 /decodingcaptchas/lib/js/classList.js Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/lib/js/head.min.js Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/lib/js/html5shiv.js Fair-Source-0.9 GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/highlight/highlight.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/leap/leap.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/markdown/markdown.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/markdown/marked.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/math/math.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/multiplex/client.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/multiplex/index.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/multiplex/master.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/notes/notes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/notes-server/client.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/notes-server/index.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/print-pdf/print-pdf.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/remotes/remotes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/search/search.js GPL-2.0 GPL-3.0
 /decodingcaptchas/plugin/zoom-js/zoom.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/qunit-1.12.0.css GPL-2.0 GPL-3.0
 /decodingcaptchas/test/qunit-1.12.0.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-markdown-element-attributes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-markdown-slide-attributes.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-markdown.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test-pdf.js GPL-2.0 GPL-3.0
 /decodingcaptchas/test/test.js GPL-2.0 GPL-3.0

What this shows is each file along with each license that the program believes it to belong to. Files prefixed with >>>>>>> indicate a license file which changes the licenses of the files below it.

The root of the project is dual licensed under GPL-2.0 and GPL-3.0 which makes everything have at least those two licenses. There are however also some fonts under the SIL Open Font License 1.1 which were also picked up. To make things a little harder I added the Fair Source license under the lib directory (just as a test) to see if this would be reflected correctly.

I am pretty happy with the result. So I took the code and ran it against GCC and the Linux Kernel. The output for both are too large to post on this blog but you can view the GCC one as a gist https://gist.github.com/boyter/ee534011cf512ba7e2992ecdef87523c

The last part was to add back the file check so that we can identify license headers for each file in the subfolders. Thankfully this was just a short tweak. Of course the generation is now much slower due to add the additional processing but it seemed to work pretty well.

Well that about covers that. Everything seems to work as I would expect and it would be reasonably easy at this point to take the code and produce a full blown SPDX parser. I will leave that as an exercise for the reader (unless it turns out I need one at a later date).

The next step for me is to integrate this into searchcode server, which is just porting the above into Java and then adding this information as a facet for searching. This is something I may write about later. If you are interested in a code search application though please check it out!

I have uploaded all the code into the following github repository released under the MIT License. Note that the database will also be released under Creative Commons inside the searchcode-server project itself and it is also something I will be keeping up to date so it might be a better source if you want to do any license matching.