March 20, 2011

A distributed pipeline for processing text

Usually, Hadoop is the way to go.

However, I have joined a project that has been underway for more than a year, and the processes have been written in mostly an ad-hoc way - shell, python, and Java standalone programs. Converting each of these to mappers and reducers would have been an arduous task.

I decided to re-write the pipeline in SCons. There are many things about this pipeline that represent a conventional build. There are dependencies, and usually newer functionality/processing is added to the later stages of the pipeline. Luckily, SCons takes in regular python functions as "Builders", which I hooked into xml-rpc functions, and we soon had SCons running the pipeline on multiple servers (just five, actually - that's all we'd get for our pipeline). The file-system is an NFS share, which simplifies things a great deal.

Python, however, has been a bit on the slower side. Also, invoking the Java VM every time you need to process a file feels like too much of an overhead. So while the pipeline is functional, and processes the corpus much faster than before (5-6 hours vs 20+ earlier), we are considering re-writing the XML-RPC server in Java. The standalone programs can be easily ported to the server implementation, and invoking shell scripts from Java shouldn't be very different from invoking them from python - things should only improve. I wonder, however, if I should have written this in Hadoop to start with.