Sphinx Real Time Index How to Distribute and Hidden Gotcha

I have been working on real time indexes with Sphinx recently for the next version of searchcode.com and ran into a few things that were either difficult to search for or just not covered anywhere.

The first is how to implement a distributed search using real time indexes. It’s actually done the same way you would normally create an index. Say you had a single server with 4 index shards on it and you wanted to run queries against it. You could use the following,

index rt
{
    type = distributed
    local = rt1
    agent = localhost:9312:rt2
    agent = localhost:9312:rt3
    agent = localhost:9312:rt4
}

You would need to have each one of your indexes defined (only one is added here to keep the example short)

index rt1
{
    type = rt1
    path = /usr/local/sphinx/data/rt1
    rt_field = title
    rt_field = content
    rt_attr_uint = gid
}

Using the above you would be able to search across all of the shards. The trick is knowing that to update you need to update each shard yourself. You cannot pass documents to the distributed index but instead must make a separate update to each shard. Usually I split sphinx shards based on a query like the following,

SELECT cast((select id from table order by 1 desc limit 1)/4 as UNSIGNED)*2, \
         cast((select id from table order by 1 desc limit 1)/4 as UNSIGNED)*3 \
         FROM table limit 1

Where the 4 is the number of shards and the multiplier splits the shards out. It’s performant due to index use. However for RT I suggest a simple modulas operator % against the ID column for each shard as it allows you to continue to scale out to each shard equally.

The second issue I ran into was that when defining the attributes and fields you must define all the fields before the uints. The above examples work fine but the below is incorrect. I couldn’t find this mentioned in the documentation.

index rt
{
    type = rt
    path = /usr/local/sphinx/data/rt
    rt_attr_uint = gid # this should be below the rt_fields
    rt_field = title
    rt_field = content
}

GPL Time-bomb an interesting approach to #FOSS licensing

UPDATES Following some feedback I am going to rename my usage of “Time-Bomb” due to potential negative connotation on the words. I am going to call it “Eventually Open”. Also a few other things need mentioning. I am not looking for code submissions back into the source at this time. This was a move to show that there are no back-doors in the code sending source code back to a master server.

About a week ago I released searchcode server under the fair source licence. From day one I had wanted to release it using some form of licence where the code was available but I wanted to lock it somewhat because frankly I do want to make some money out of my time investment. That’s not the whole story however. I did not want to create another “Look but don’t touch” situation forever and I certainly didn’t want searchcode to be constrained by a licence in the event that I die, lose interest or stop updating the code.

The result of this was that I have added what I am going to call a GPL Time-Bomb into into the licencing of searchcode server. Here is how it works. After a specified period of time the current version of searchcode server can be re-licensed under the GPL v3. This is a shifting date such that each new release extends its own time-bomb further into the future. However the older releases time is still fixed. The time-bomb for version 1.2.3 and 1.2.4 takes place on the 27th August 2019 at which point you can take the source using GPL 3.0. Assuming searchcode server 6.1.2 comes out at roughly the same time its time-bomb will be set to the 27th of August 2022 but the 1.2.3 release will be unaffected.

In short I have put a time limit of 3 years to make money out of the product and if I am unable it is turned over to the world to use as they see fit. Even better, assuming searchcode server becomes a successful product I will be forced to continually improve it and upgrade if I want to keep a for sale version without there being an equivalent FOSS version around (which in theory could be maintained by the community). In short everyone wins from this arrangement, and I am not forced to rely on a support model to pay the bills which frankly only works when you have a large sales team.

Here’s hoping this sort of licencing catches on as there are so many products out there that could benefit from it. If they take off the creators have an incentive to maintain and not milk their creation and those that become abandoned even up available for public use which I feel is a really fair way of licencing software.

Agree? Disagree? Email me or hit me up on twitter.

searchcode server under fair source

A very quick blog today. I have released searchcode server under the fair source licence. This means that as of a few days ago you can view the source, change it modify it and run it as you see fit so long as you have less than 5 users.

The source is hosted on github (I may move this to GitLab sometime in the future) and you can view it here.

So what does this mean? Well the community edition still exists (run searchcode with as many users as you want) as do the paid versions with support and all the full features. The real advantage however is that you can now vet the source code to ensure that searchcode server is not secretly sending your most valuable asset to some hidden server somewhere. In addition it means I can now talk about the source openly and will be writing some posts about how I ran into some CPU branching issues which slowed down some code.

Good news all around then. Be sure to check out the source and let me know what you think.

Syncing Stash/BitBucket with searchcode server

Recently it came up to perform a slight integration piece between a on premises Stash/BitBucket install and a searchcode server install. Thankfully both have an API and very thankfully there is a nice Python library for talking to Stash/BitBucket.

Below is the code used. It pulls out all of the repositories from every project, checks if it exists in searchcode and if not adds it as a repository to be indexed. You need to install stashy (pip install stashy) and run it whenever you have new repositories. One idea is to set it as a cron task and ensure everything is in sync.

Note that this does not remove repositories that have been indexed, but it would not take much work to achieve it.

import stashy
from hashlib import sha1
from hmac import new as hmac
import urllib2
import json
import urllib

def getstashrepos():
    stash = stashy.connect("https://mystashserver/", "STASH_USERNAME", "STASH_PASSWORD")

    projects = stash.projects.list()
    repos = [stash.projects[x['key']].repos.list() for x in projects]

    stashrepos = []

    for repo in repos:
        stashrepos = stashrepos + [{'name': x['project']['key'] + '-' + x['slug'],
                                    'cloneUrl': x['cloneUrl'],
                                    'browse': x['links']['self'][0]['href']} for x in repo]

    return stashrepos

def addtosearchcode(repo):
    reponame = repo['name']
    repourl = repo['cloneUrl']
    repotype = "git"
    repousername = "STASH_USERNAME"
    repopassword = "STASH_PASSWORD"
    reposource = repo['browse']
    repobranch = "master"

    message = "pub=%s&reponame=%s&repourl=%s&repotype=%s&repousername=%s&repopassword=%s&reposource=%s&repobranch=%s" % (
            urllib.quote_plus(publickey),
            urllib.quote_plus(reponame),
            urllib.quote_plus(repourl),
            urllib.quote_plus(repotype),
            urllib.quote_plus(repousername),
            urllib.quote_plus(repopassword),
            urllib.quote_plus(reposource),
            urllib.quote_plus(repobranch)
        )

    sig = hmac(privatekey, message, sha1).hexdigest()

    url = "http://mysearchcodeserver/api/repo/add/?sig=%s&%s" % (urllib.quote_plus(sig), message)

    data = urllib2.urlopen(url)
    data = data.read()

    data = json.loads(data)
    print reponame, data['sucessful'], data['message']


publickey = "MY_SEARCHCODE_PUBLIC_KEY"
privatekey = "MY_SEARCHCODE_PRIVATE_KEY"

message = "pub=%s" % (urllib.quote_plus(publickey))

sig = hmac(privatekey, message, sha1).hexdigest()
url = "http://mysearchcodeserver/api/repo/list/?sig=%s&%s" % (urllib.quote_plus(sig), message)

data = urllib2.urlopen(url)
data = data.read()

data = json.loads(data)
existingrepos = [x['name'] for x in data['repoResultList']]

for repo in getstashrepos():
    if repo['name'] not in existingrepos:
        addtosearchcode(repo)

searchcode.com: The Architecture – migration 3.0

27th July 2016 at about 9:30pm my local time (GMT +10) I updated the A records for searchcode’s nameservers to point at a new stack that has been several months in the making. As with most posts of this sort of nature a quick recap of where things were and where they are now.

The previous searchcode stack consisted of two dedicated servers hosted by Hetzner. I have previously discussed this about two years ago when discussing searchcode next. The first server was a reasonably powerful machine running pretty much all of code required to deliver searchcode.com itself with the exception of the actual index. searchcode is a Django application and was using nginx to serve results directly out of memcached where possible to avoid consuming a running Gunicorn process. The move to have another server for the index came pretty quickly after searchcode was released as it was just not performant enough with everything on a single box which was the situation for a short time.

searchcode before

This structure worked well for the last 2 years or so but I had noticed that the load average on the frontend was starting to average out at about 3.0+ and would quite often rise to 7.0+ Considering the machine had only 4 real CPU cores this was an issue. Interestingly it also caused the number of requests that searchcode could respond to to max out. In addition the MySQL database had some corruption somewhere in the middle which made to app increasingly unstable. The application was going down where nothing short of a reboot would save it at least once a month. Finally I had grown as a developer over the last two years and it was time to move to something more stable and better performing.

The first thing in this modern cloud world was start looking at moving to something like AWS or DigitalOcean. I firstly created a simple spreadsheet with what I was looking to move towards along with expected costs. In short I wanted to break searchcode apart so that it consisted of the following parts,

Frontend server. Would provide SSL termination to Varnish and reverse proxy back to application servers. Requires, either a lot of RAM or fast disk. If RAM is low but disk is fast can use Varnish with disk cache.

Backend server. Would run the application itself. Scales horizontally. Requires a fast CPU to process the code results.

Indexer server. Would host the Sphinx indexer. Scales horizontally. Requires a fast CPU, reasonably amounts of RAM and about 25 GB of amount of disk space per CPU core, preferably SSD.

Database server. Would host the MySQL database. Does not scale horizontally. Requires lots of RAM (more than 4 GB) and 2 TB of disk space.

With these requirements in mind. I started shopping around.

AWS was ruled out almost immediately due to the cost. This was in spite of the fact that I have a great deal of experience dealing with the AWS API’s. As a rule I run searchcode.com as lean as possible and AWS while brilliant ended up costing more than I would have liked.

I then started looking at DigitalOcean. I have always been a fan of how fast they spin instances up, the simple API and their prices. They also have excellent support and are pretty lenient when it comes to how much CPU and DiskIO you consume. When I started looking however they had not launched Block Storage, which meant I needed the $640 a month plan for the database which was even more expensive then AWS. Later they did release block storage, but the resulting price was still rather high. I still use DigitalOcean for spinning up test stacks.

I also looked at Vultr. It’s pretty safe to say they are a DigitalOcean clone but with more data centers, competitive prices and much worse customer support. They do have one intriguing server option however not mentioned on their public website, which is a storage server. It’s a server backed by a regular spinning rust style disk, but as such is far cheaper than anything offered by any other company. A database server with 1 TB of storage, 4 GB RAM and 2 CPU’s (the largest storage instance they have) is only $40 a month.

A believer in infrastructure as code I was very keen to move to one of the cloud servers and picked Vultr based on the storage instance and cost. I started importing the searchcode database into the storage instance. However about 3/4 the way through I received an email from Vultr support saying that I was exceeding the DiskIO of the storage instance using a sustained 125 IOPS. Fair enough, and I would have been willing to throttle it back if I could gain some assurance about the terms of service. Sadly Vultr lived up to their reputation of having poor support and I am yet to hear back. I terminated the instance and started looking again. They do offer dedicated instances and the prices are actually not too bad, but did not have enough disk space for my needs. I still use them for spinning up test stacks as they have a Sydney data centre which for me has lower network latency and offsets the slower creation time.

I realised that the only real option for something like searchcode (which is fully bootstrapped) is that I need as much power as possible and the lowest price. I also don’t mind getting my hands dirty and can deal without the support. Lastly I need to be able to abuse the machine as I see fit and the only way to do that is go dedicated.

Thankfully Hetzer has a very nice server auction house. You can browse through and pick up used servers for a considerable discount over a new order. In addition you don’t have to pay the setup fee. To avoid the database corruption issue I experienced previously I went shopping for a ECC RAM server for the database and quickly found a tidy machine with 32 GB RAM and enough disk space. For the frontend I just picked up the cheapest 32 GB machine I could find which interestingly was similar to my previous frontend machine but with 32 GB of RAM. Lastly I went looking for something to server as the backend and the indexer. I quickly realized that I could combine both the machines into one and save a few dollars. With this saving I was able to overlook the initial setup fee and went for two machines with 32 GB of RAM and 500 GB SSD’s. Since Hetzner cannot spin servers up and down like a cloud provider I hard-coded the details into my fab file for building the stack but otherwise everything deploys as though I was using a cloud provider. Only it costs considerably less and should be much much faster.

searchcode current

The results?

Well as mentioned the previous instances were sitting around a load average of 3.0+ most of the time. The new backend/indexer boxes are sitting at a load average of 0.1+ which is a massive improvement. The database and frontend are similarly loaded. The DNS at this point is still flipping over for some so its not serving all results yet but I cannot imagine the load rising beyond 1.0+ for everything.

With the hardware decisions discussed lets dive into the software starting with the data storage. MySQL has always been my database of choice simply because I am more familiar with it than any other database. I had previously toyed with migrating to Postgresql but decided against it simply because there are other things I should focus my time on that actually provide real value. As such I have stuck with that for the latest searchcode version however one change I did make was to upgrade to 5.7 so I can leverage the native JSON data type. Otherwise the only change was to modify the MySQL config to take advantage of the power of the box as per best practices.

The backend machines have Nginx set to listen to incoming connections from the frontend. They pass requests back to Gunicorn/Django which performs the appropriate action. Where possible the result using Django/Memcached is served directly, if not a request to Sphinx may be made for search results followed by a lookup to the database, or a direct request to the database. Gunicorn is configured to have four worker processes per box. Sphinx is configured using a distributed index with 4 agents running on each box and one box as the master. Memcached is currently configured with 8 GB of RAM for each backend instance but is not pooled together.

The frontend machine is configured using Varnish with 24 GB of the RAM allocated to cache. The remaining 8 GB is deliberately left over for use by OS caching and Nginx. Nginx is the frontend to the whole system providing SSL termination and proxying back to Varnish. I did briefly consider using HAProxy for this role, but since I was already familiar with Nginx and at the scale searchcode currently operates at Nginx was a better choice. If in time there is much greater load moving to HAProxy is something that will be considered.

Whats next? Well the future of searchcode.com at this point is to leverage the new machines to increase the size of the index and make it refresh the code more quickly. I have a few plans on how to do so and will release details in the next update. If you have read this far you are a beast! Thanks for the support and feel free to email with any further questions.

searchcode server released

searchcode server the downloadable self hosted version of searchcode.com is now available. A large amount of work went into the release with a variety of improvements based on feedback from the general beta releases.

searchcode server

searchcode server has a number of advantages over searchcode.com that will eventually be back-ported in. The full list of things to check out is included below,

  • New Single Page Application UI for smooth search experience
  • Ability to split on terms so a search for “url signer” will match “UrlSigner”
  • Massively improved performance 3x in the worst case and 20x in the best
  • Configurable through UI and configuration
  • Spelling suggestion that learns from your code

A few things of note,

  • Java 8 application built using Lucene and Spark Framework
  • Designed to work any server. The test bench server is a netbook using an Intel Atom CPU and searches return in under a second
  • Scales to Gigabytes of code and thousands of repositories
  • Works on Linux, OSX and Windows

Be sure to check it out!

searchcode server
searchcode server
searchcode server

searchcode server

A month or so ago I started collection emails on searchcode.com to determine if there was enough interest in a downloadable version of searchcode. The results were overwhelmingly positive. The email list grew far beyond what I would have expected, and this was in the first month. As such I have been working in this downloadable version of searchcode which will probably be called searchcode server.

Progress has been reasonably straight forward consider that searchcode.com is written using mostly Python and searchcode server is mostly Java. The main reason for choosing Java is that I really wanted searchcode server to be a self contained application which could be downloaded and run without the configuration and setup of additional services.

At present it is surprisingly workable. You can input repositories (git only at this point) to be indexed and after a short amount of time they will be searchable via the main interface. A few screenshots are included at the end of this post for those curious.

There is still time to sign up and be one of the first to receive access. Being on the sign up list will also give you a discount when it is actually released if you need something greater than the community edition. To register your interest use the form below, or visit the searchcode server product page.

This has been released and you can now get the actual product.

Screen Shot 2015-12-29 at 1.04.54 pm

Screen Shot 2015-12-29 at 1.11.30 pm

searchcode local

I am going to copy the searchcode pitch itself below quickly before explaining it a bit further.

“searchcode offers powerful code search over billions of lines of open source code. Imagine what it could do with your private repositories.

There have been requests to offer a downloadable version of searchcode. Given enough interest a downloadable hostable version of searchcode will be offered. Register your email below to register your interest.

Note that there would be a free Community version available for all users as well as paid version offering support. Functionality would remain the same across all versions. This would be similar to how Octopus Deploy is offered.”

In short I am considering writing a hostable version of searchcode. Most likely it would consist of a Java application one could download and use to get similar results to searchcode.com itself (probably at smaller scale however).

Rather then actually commit to several months worth of work however I have put a message on searchcode asking for those interested to register their interest. If it sounds like something you would like please register.

I have no signup target numbers in mind or product costs etc… but I suspect given over 100 sign-ups I will actually go forth and implement.

I should note that this is something I have been highly resistant towards for a long time as I do not really want to get into enterprise sales cycles.

Anyway in a months time if there are enough signups I will push forward and release an initial version to those who have signed up. Any one who does will get free access on the beta list and discounts on the final version (should they need something more powerful then the community edition).

This has been released and you can now get the actual product.

Go Forth and Search

A very fast update. At the request of the excellent Lars Brinkhoff via GitHub I have added in the language Forth to be one of the supported languages inside searchcode.

An example search which shows this working would be the following https://searchcode.com/?q=forth&loc=0&loc2=10000&lan=181

I had to solve a number of interesting problems inside searchcode to support this change. For pragmatic reasons the way searchcode identifies what language any piece of code is written in is to run it though CLOC (Count Lines Of Code). Written in perl it does a reasonably good job of pulling out metadata for any given piece of code. However since my perl ability is poor at best submitting a patch to support forth was not going to be an option.

Instead I ended up adding an additional few checks at the end of the indexing pipeline to identify code that probably should have been categorised as forth and if so change the classification. It has been designed to be extensible so if other languages come up that are not currently identified it should be possible to add them as well.

The only other change of note for searchcode is that I fixed the SSL certificate chain and now you can curl the API again. This was an issue caused by Google throwing its weight around and outlawing SHA1 certificates. When updating to fix this I neglected to fix the chain as well. Oddly browsers worked without issue whereas curl and Python requests broke.

searchcode the path to profitability

One of the things that has always bothered me about searchcode.com was that it never generated any money. Not a huge problem in itself as a side project, but the costs to run it are not insignificant due to the server requirements. I had looked into soliciting donations but I considered this highly unlikely to produce enough revenue to cover costs considering that sites such as gwern.net was unable to make enough to cover even basic costs through patreon (although since a recent HN post this has jumped from around $20 a month to over $150).

This had caused me back in the early days to use buysellad’s to attempt to cover some costs. While this certainly helped there was usually not enough revenue due to the way the ads are sold. The issue with buysellads is that you have to pitch your website as a good place to sell ads against. This is not something I had any great desire to do. Simply put if I am going to spend my time marketing something its going to be something that is more directly marketable then an advertising funded website. The other issue is that buysellads does not work with HTTPS which became a deal breaker for myself once Google announced that they were going to use it as a signal for ranking.

This lead me to just a few months ago considering shutting down the site. However while doing some poking around I noticed a newish advertising platform called Carbonads. Invite only I decided to email them with my pitch. Being a developer/designer focused ad platform it seemed like a natural fit. After a bit of back and forth I can now happily say that searchcode.com is running carbonad’s. I have no numbers to report at this time but I am very hopeful based on some estimates made with carbon ads that I should be able to push searchcode to cover costs and hopefully produce some profit, all of which will be used to improve the service.