searchcode server

A month or so ago I started collection emails on searchcode.com to determine if there was enough interest in a downloadable version of searchcode. The results were overwhelmingly positive. The email list grew far beyond what I would have expected, and this was in the first month. As such I have been working in this downloadable version of searchcode which will probably be called searchcode server.

Progress has been reasonably straight forward consider that searchcode.com is written using mostly Python and searchcode server is mostly Java. The main reason for choosing Java is that I really wanted searchcode server to be a self contained application which could be downloaded and run without the configuration and setup of additional services.

At present it is surprisingly workable. You can input repositories (git only at this point) to be indexed and after a short amount of time they will be searchable via the main interface. A few screenshots are included at the end of this post for those curious.

There is still time to sign up and be one of the first to receive access. Being on the sign up list will also give you a discount when it is actually released if you need something greater than the community edition. To register your interest use the form below, or visit the searchcode server product page.

 



 

Screen Shot 2015-12-29 at 1.04.54 pm

Screen Shot 2015-12-29 at 1.11.30 pm

Pi-Hole for Ubuntu 14.04

Because of the fact that I personally work for an ad supported company and that searchcode.com is currently supported via third party advertising I tend to keep an eye on the state of ad blockers on the web.

Most people probably know about adblockplus and other browser extensions however there are other ways to block ad’s on ones network. One that I had previously read about was setting up your own Bind9 server on a server and adding custom rules to block them at a DNS level. Other the last week I had been playing around with this but since I am not a bind expert I was unable to get it working in a satisfactory way.

However the following article about blocking all ads using a Raspberry Pi appeared on my radar. I don’t have a Raspberry Pi, but I did have an old netbook (Asus Eee 1000HA) lying around that I was trying to find some use for. I had previously set it up with Ubuntu 14.04 and had it running under the house running OwnCloud as a test. I thought it might be a good candidate for this sort of thing.

The install was pretty easy and as simple as following the guide on http://pi-hole.net/ It says that you need to be using Raspbian but works perfectly for me. Thankfully I have a reasonably good router (D7000 which I can highly recommend) and once the setup was done I pointed its DNS at the new server and sat back for things to start working. It did. Flawlessly.

I think the advertising industry is in for a rude shock. When these devices are as cheap as this and as simple to install its only a matter of time before they become a built in to the router itself or a plug and play.

searchcode local

I am going to copy the searchcode pitch itself below quickly before explaining it a bit further.

“searchcode offers powerful code search over billions of lines of open source code. Imagine what it could do with your private repositories.

There have been requests to offer a downloadable version of searchcode. Given enough interest a downloadable hostable version of searchcode will be offered. Register your email below to register your interest.

Note that there would be a free Community version available for all users as well as paid version offering support. Functionality would remain the same across all versions. This would be similar to how Octopus Deploy is offered.”

In short I am considering writing a hostable version of searchcode. Most likely it would consist of a Java application one could download and use to get similar results to searchcode.com itself (probably at smaller scale however).

Rather then actually commit to several months worth of work however I have put a message on searchcode asking for those interested to register their interest. If it sounds like something you would like please register.

I have no signup target numbers in mind or product costs etc… but I suspect given over 100 sign-ups I will actually go forth and implement.

I should note that this is something I have been highly resistant towards for a long time as I do not really want to get into enterprise sales cycles.

Anyway in a months time if there are enough signups I will push forward and release an initial version to those who have signed up. Any one who does will get free access on the beta list and discounts on the final version (should they need something more powerful then the community edition).

 



Go Forth and Search

A very fast update. At the request of the excellent Lars Brinkhoff via GitHub I have added in the language Forth to be one of the supported languages inside searchcode.

An example search which shows this working would be the following https://searchcode.com/?q=forth&loc=0&loc2=10000&lan=181

I had to solve a number of interesting problems inside searchcode to support this change. For pragmatic reasons the way searchcode identifies what language any piece of code is written in is to run it though CLOC (Count Lines Of Code). Written in perl it does a reasonably good job of pulling out metadata for any given piece of code. However since my perl ability is poor at best submitting a patch to support forth was not going to be an option.

Instead I ended up adding an additional few checks at the end of the indexing pipeline to identify code that probably should have been categorised as forth and if so change the classification. It has been designed to be extensible so if other languages come up that are not currently identified it should be possible to add them as well.

The only other change of note for searchcode is that I fixed the SSL certificate chain and now you can curl the API again. This was an issue caused by Google throwing its weight around and outlawing SHA1 certificates. When updating to fix this I neglected to fix the chain as well. Oddly browsers worked without issue whereas curl and Python requests broke.

Exporting Documents from KnowledgeTree 3.7.0.2

I was recently tasked with exporting a large collection of documents from KnowledgeTree (KT) for a client. The collection was too large to use the download all functionality and too wide to attempt to export each folder individually.

I had played around with the WebDav connection that KT provides but it either didn’t work or was designed deliberately to not allow exporting of the documents.

I looked at where the documents were  stored on disk but KT stores them as numbered files in numbered directories sans extension or folder information.

Long story short I spent some time poking through the database to identify the tables which would contain the correct metadata which would allow me to rebuild the tree using a proper filesystem. For record the tables required are the following,

  • folders – Contains the folder tree. Each entry represents a folder and contains its parent folder id.
  • documents – Contains the documents that each folder contains. Knowing the folders id you can determine what documents live in that folder.
  • document_content_version – Contains the metadata required to get the actual file from disk. A 1 to 1 mapping between document id and this table is all that is required.

That said here is a short Python script which can be used to rebuild the folders and documents on disk. All that is required is to ensure that Python MySQLdb is installed and to set the database details. Depending on your KT install you may need to change the document location. Where  the script is run it will replicate the folder tree containing the documents preserving the structures, names and extensions.

Keep in mind this is a fairly ugly script abusing global variables and such. It is also not incredibly efficient, but did manage to extract 20GB of files in my case in a little under 10 minutes.

import MySQLdb
import os
import shutil

# KnowledgeTree default place to store documents
ktdocument = '/var/www/ktdms/Documents/'

conn = MySQLdb.connect(user='', passwd='',db='', charset="utf8", use_unicode=True)
cursor = conn.cursor()

# global variables FTW
cursor.execute('''select id, parent_id, name from folders;''')
allfolders = cursor.fetchall()

cursor.execute('''select id, folder_id from documents;''')
alldocuments = cursor.fetchall()

cursor.execute('''select document_id, filename, storage_path from document_content_version;''')
document_locations = cursor.fetchall()


# create folder tree which matches whatever the database suggests exists
def create_folder_tree(parent_id, path):
    directories = [x for x in allfolders if x[1] == parent_id]
    for directory in directories:
        d = '.%s/%s/' % (path, directory[2])
        
        print d
        os.makedirs(d)
        # get all the files that belong in this directory
        for document in [x for x in alldocuments if x[1] == directory[0]]:
            try:
                location = [x for x in document_locations if document[0] == x[0]][0]
                print 'copy %s%s %s%s' % (ktdocument, location[2], d, location[1])
                shutil.copy2('%s%s' % (ktdocument, location[2]), '%s%s' % (d, location[1]))
            except:
                 print 'ERROR exporting - Usually due to a linked document.'

        create_folder_tree(parent_id=directory[0], path='%s/%s' % (path, directory[2]))

create_folder_tree(parent_id=1, path='')

Decoding CAPTCHA’s Handbook

Some time ago I wrote an article about Decoding CAPTCHA’s which has become what appears to be the first resource most people encounter when searching for information in the decoding CAPTCHA space.

I had continued to write about CAPTCHA’s over the years with posts scattered around the web. A while ago I started to consolidate all of my content on this blog and realised that I had considerably more CAPTCHA related articles then I thought. Some were in an unfinished or unpublished state. I had considered posting them all online but instead decided to polish it all up into a much better resource and publish it as a book.

The book is now online and available for sale. Its not what you would call a top seller but has produced enough sales to offset some hosting costs for the blog which was one goal. I also wanted to test the waters when trying to sell an info product. Info products according to Amy Hoy and Rob Walling are an excellent platform to start learning how to build recurring revenue online. Certainly one thing I have learnt is that something borderline unethical such as the book will never become a great seller. Simply put the audience is unlikely to convert to be a high sales business because of the shady nature of the content.

Anyway for those interested in purchasing a copy you can either visit the Decoding CAPTCHA’s page or use the below.


Decoding CAPTCHA's Book
Looking for a practical guide to CAPTCHA decoding? All About CAPTCHA’s. This eBook will teach you how to identify weaknesses and exploit CAPTCHA’s from beginning to end.

Buy now using LeanPub

Buy now using Gumroad

C# as a Language from old Google+ Post

The more I use C# as a language for writing things the more I am convinced that its approach really is the best language approach out there.

The unit test support is excellent which allows development speed to be just as fast as any dynamic language (Python, PHP, Perl).

The static typing catches so many issues before you get to runtime and allows sweeping changes without breaking things.

Unlike Java it has the var keyword (saves time and improves readability) and so many more useful functions which yes you can replicate but are just built in and work correctly.

Then you get to the really good stuff. LINQ is awesome. The lazy loading allows you to implement a repository pattern over your database which is just awesome. Set up the basic select * from then add extension methods allowing you to chain whatever you need, EG

from person in _dbContext.GetPerson().ByUserName(username).ByPassword(password);

100% elegant, easy to test, easy to write, easy to read and understand and generally works exactly as you would expect without any hidden gotchas. And because its lazy it doesn’t chew resources sucking back everything from the database.

You can use functional programming techniques if you wish, and with the new async decorators you can work in a node.js style if you with, with static typing and all existing library support.

Or you can continue to work in a C like manner, or mix it up with objects, procedural code and functional.

I switched back to Java not that long ago to write a simple server using Jetty and even with things like Guice (best DI implementation I have used so far) and Guava it was still painful. Less painful, but I really felt that the compiler was fighting me from doing things in an elegant manner most of the time. Even adding the “var” keyword would improve Java in a massive way. Add some functional programming in there and I would be pretty happy.

I just wish C# would run on the JVM as I would use it for pretty much everything in a heartbeat. As it is the Mono support is missing the stuff I really want and isn’t as seamless as the experience should be. A pity really as C# really is in my experience the nicest language to work today that’s production ready.

searchcode the path to profitability

One of the things that has always bothered me about searchcode.com was that it never generated any money. Not a huge problem in itself as a side project, but the costs to run it are not insignificant due to the server requirements. I had looked into soliciting donations but I considered this highly unlikely to produce enough revenue to cover costs considering that sites such as gwern.net was unable to make enough to cover even basic costs through patreon (although since a recent HN post this has jumped from around $20 a month to over $150).

This had caused me back in the early days to use buysellad’s to attempt to cover some costs. While this certainly helped there was usually not enough revenue due to the way the ads are sold. The issue with buysellads is that you have to pitch your website as a good place to sell ads against. This is not something I had any great desire to do. Simply put if I am going to spend my time marketing something its going to be something that is more directly marketable then an advertising funded website. The other issue is that buysellads does not work with HTTPS which became a deal breaker for myself once Google announced that they were going to use it as a signal for ranking.

This lead me to just a few months ago considering shutting down the site. However while doing some poking around I noticed a newish advertising platform called Carbonads. Invite only I decided to email them with my pitch. Being a developer/designer focused ad platform it seemed like a natural fit. After a bit of back and forth I can now happily say that searchcode.com is running carbonad’s. I have no numbers to report at this time but I am very hopeful based on some estimates made with carbon ads that I should be able to push searchcode to cover costs and hopefully produce some profit, all of which will be used to improve the service.

Filetree Listing

Just a quick update to searchcode. A few small tweaks here and there, but the largest is that there is now a file tree listing option which will show the file tree for any project. An example would be going to this file and then clicking the “View File Tree” button on the top left.

An example screenshot of the result of this is included below.

filetree

Updates to searchcode.com

Just a quick post to list some updates to searchcode.com The first is a slight modification to the home page. A while ago I received an email from the excellent Christian Moore who provided some mock-ups of how he felt it should look. I loved the designs, but was busy working on other issues. Thankfully however in the last week or so I found the time to implement his ideas and the result is far more professional to me.

searchcode_update1

It certainly is a large change from the old view but one that I really like as it is very clean. The second update was based on some observations I had. I was watching a colleague use searchcode to try finding some logic inside the Thumbor image resizer project. I noticed that once he had the the file open he was trying to navigate to other files in the same project. Since the only way to do was was to perform a new search (perhaps with the repo option) I decided to add in a faster way to do this. For some projects there is now an instant search box above the code result which allows you to quickly search over all the code inside that repository. It uses the existing searchcode API’s (which you can use as well!) to do so. Other ways of doing this would include the project tree (another piece of functionality I would like to add) but this was done using already existing code so was very easy to implement. An example would be going to this result in pypy and searching for import.

searchcode_update2

As always I would love some feedback on this, but as always expecting none (par for the course).