September 11, 2011

Authenticating people over telephone

Never seen this being done before.

I contacted a Bank of America representative over the telephone, and they needed to authenticate me before we could get down to business. I was asked my name, and was then asked to confirm a series of questions. I was not asked to state these facts myself - they stated it themselves, and asked me to confirm it.

Silly way to verify a person's identity, one would think. Anyone could pass of as me, if they had my name and my account number - and if they simply confirmed every detail that the CSR gave them. However, the CSR did make one minor mistake while stating the details - my phone number was off by one number, and I corrected them promptly. Looking back, it is quite clear that the mistake was deliberate. In this way, I did not say out any of my personal details out loud (which would have been terrible in public), and I pretty much authenticated myself by correcting one random mistake that they chose to make.

March 20, 2011

A distributed pipeline for processing text

Usually, Hadoop is the way to go.

However, I have joined a project that has been underway for more than a year, and the processes have been written in mostly an ad-hoc way - shell, python, and Java standalone programs. Converting each of these to mappers and reducers would have been an arduous task.

I decided to re-write the pipeline in SCons. There are many things about this pipeline that represent a conventional build. There are dependencies, and usually newer functionality/processing is added to the later stages of the pipeline. Luckily, SCons takes in regular python functions as "Builders", which I hooked into xml-rpc functions, and we soon had SCons running the pipeline on multiple servers (just five, actually - that's all we'd get for our pipeline). The file-system is an NFS share, which simplifies things a great deal.

Python, however, has been a bit on the slower side. Also, invoking the Java VM every time you need to process a file feels like too much of an overhead. So while the pipeline is functional, and processes the corpus much faster than before (5-6 hours vs 20+ earlier), we are considering re-writing the XML-RPC server in Java. The standalone programs can be easily ported to the server implementation, and invoking shell scripts from Java shouldn't be very different from invoking them from python - things should only improve. I wonder, however, if I should have written this in Hadoop to start with.