Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

November 18, 2012

Cheat at Letterpress

March 20, 2011

A distributed pipeline for processing text

Usually, Hadoop is the way to go.

However, I have joined a project that has been underway for more than a year, and the processes have been written in mostly an ad-hoc way - shell, python, and Java standalone programs. Converting each of these to mappers and reducers would have been an arduous task.

I decided to re-write the pipeline in SCons. There are many things about this pipeline that represent a conventional build. There are dependencies, and usually newer functionality/processing is added to the later stages of the pipeline. Luckily, SCons takes in regular python functions as "Builders", which I hooked into xml-rpc functions, and we soon had SCons running the pipeline on multiple servers (just five, actually - that's all we'd get for our pipeline). The file-system is an NFS share, which simplifies things a great deal.

Python, however, has been a bit on the slower side. Also, invoking the Java VM every time you need to process a file feels like too much of an overhead. So while the pipeline is functional, and processes the corpus much faster than before (5-6 hours vs 20+ earlier), we are considering re-writing the XML-RPC server in Java. The standalone programs can be easily ported to the server implementation, and invoking shell scripts from Java shouldn't be very different from invoking them from python - things should only improve. I wonder, however, if I should have written this in Hadoop to start with.

March 19, 2009

On git, gitosis, and python issues on Windows Vista

Some browsing, debugging, and IRC chats later, I have managed to set up a git repository on Windows Vista using cygwin, with a few unexpected hiccups. I will try to repeat the process on a virgin setup to come up with a more authoritative flowchart of how to go about things. For now, I'll just list down the issues I faced.

I used gitosis to host git repositories over ssh. It's pretty elegant, really. Administering gitosis is limited to managing a configuration file and user keys, which itself themselves are in a gitosis hosted git repository. Neat.

These two articles

cover everything you need to do to host git repos. The first link should be straightforward. However, since the second link is for linux users, there are a few deviations for windows. Try to follow the link, if you face difficulties, refer to the tips below.

Installing gitosis

Log in to an administrator account. Do a

cd ~/src
git clone git://eagain.net/gitosis.git
cd gitosis
python setup.py install
If the last command fails thusly
Traceback (most recent call last):
File "setup.py", line 2, in ?
from setuptools import setup, find_packages
ImportError: No module named setuptools
you may need setuptools from here. Scroll to the bottom, download an egg, and do
sh setuptools-0.6c9-py2.5.egg
Repeat the last step. After you have installed gitosis successfully, run the following command
chmod +r /usr/lib/python2.5/ -R
This was the first wtf. The downloaded egg we just installed gets installed with administrator ACLs, the above command ensures that everyone has access to the downloaded eggs.

Setting up gitosis

You need to add the git repository user `git' the windows way (the adduser command in the second link will not work). Once you've done that, make sure you've run the following commands

# in the new 'git' user's account
ssh-user-config

# from the admin account
# (domain users need to add '-d domain_name' to the mk* commands)
mkpasswd.exe -l > /etc/passwd
mkgroup.exe -l > /etc/group
Also, the command
sudo -H -u git gitosis-init < /tmp/id_rsa.pub
will need to be run as
# git user's account
gitosis-init < /tmp/id_rsa.pub
Of course, this assumes that you've copied your key to the /tmp folder.

The rest of the write-up should be fine, except for one thing. When you try to `git commit' you repo configuration, you may see an error like :
Counting objects: 5, done.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 307 bytes, done.
Total 3 (delta 1), reused 0 (delta 0)
To git@localhost:gitosis-admin.git
989f371..0616ebb master -> master
: invalid optione: line 2: set: -
set: usage: set [--abefhkmnptuvxBCHP] [-o option] [arg ...]
Second wtf. Damn those pesky windows newlines. Do a
# `git' user's account
dos2unix ~/repositories/gitosis-admin.git/hooks/post-update
and you should be good to go.

This should do it. If you still run into problems, let me know. If you find solutions to them, post them up so that I may link to you. And when you do have a git repository up, give me the url to fork from :)

Edit: fixed grammar