Showing posts with label code. Show all posts
Showing posts with label code. Show all posts
November 18, 2012
March 20, 2011
A distributed pipeline for processing text
Usually, Hadoop is the way to go.
However, I have joined a project that has been underway for more than a year, and the processes have been written in mostly an ad-hoc way - shell, python, and Java standalone programs. Converting each of these to mappers and reducers would have been an arduous task.
I decided to re-write the pipeline in SCons. There are many things about this pipeline that represent a conventional build. There are dependencies, and usually newer functionality/processing is added to the later stages of the pipeline. Luckily, SCons takes in regular python functions as "Builders", which I hooked into xml-rpc functions, and we soon had SCons running the pipeline on multiple servers (just five, actually - that's all we'd get for our pipeline). The file-system is an NFS share, which simplifies things a great deal.
Python, however, has been a bit on the slower side. Also, invoking the Java VM every time you need to process a file feels like too much of an overhead. So while the pipeline is functional, and processes the corpus much faster than before (5-6 hours vs 20+ earlier), we are considering re-writing the XML-RPC server in Java. The standalone programs can be easily ported to the server implementation, and invoking shell scripts from Java shouldn't be very different from invoking them from python - things should only improve. I wonder, however, if I should have written this in Hadoop to start with.
However, I have joined a project that has been underway for more than a year, and the processes have been written in mostly an ad-hoc way - shell, python, and Java standalone programs. Converting each of these to mappers and reducers would have been an arduous task.
I decided to re-write the pipeline in SCons. There are many things about this pipeline that represent a conventional build. There are dependencies, and usually newer functionality/processing is added to the later stages of the pipeline. Luckily, SCons takes in regular python functions as "Builders", which I hooked into xml-rpc functions, and we soon had SCons running the pipeline on multiple servers (just five, actually - that's all we'd get for our pipeline). The file-system is an NFS share, which simplifies things a great deal.
Python, however, has been a bit on the slower side. Also, invoking the Java VM every time you need to process a file feels like too much of an overhead. So while the pipeline is functional, and processes the corpus much faster than before (5-6 hours vs 20+ earlier), we are considering re-writing the XML-RPC server in Java. The standalone programs can be easily ported to the server implementation, and invoking shell scripts from Java shouldn't be very different from invoking them from python - things should only improve. I wonder, however, if I should have written this in Hadoop to start with.
June 07, 2009
Using javascript to avoid the mouse and page scrolling
Here's the problem. It is not very easy to scroll a document when you're inside an input element. Arrow keys don't work, and Page Up/Page Down jump in big increments. What if you want to see just a few lines below the current element? Our clients hate to scroll. And they hate having to use the mouse. This just brings the two together.
FScroll is a JQuery plug-in which makes a page scroll to the currently focussed element, keeping it's position centered with respect to the document. This helps keep a bit of "context" around the currently focussed element - since it is centered, you can see a few elements both above and below the currently focussed element.
Here are the sources. And here's a page explaining it's usage in some detail. And oh, it does nested centering too. But it requires that the 'nesting' container have a css styling of
You may report issues here.
FScroll is a JQuery plug-in which makes a page scroll to the currently focussed element, keeping it's position centered with respect to the document. This helps keep a bit of "context" around the currently focussed element - since it is centered, you can see a few elements both above and below the currently focussed element.
Here are the sources. And here's a page explaining it's usage in some detail. And oh, it does nested centering too. But it requires that the 'nesting' container have a css styling of
position : relative
(in the demo page, the div enclosing the table is positioned relative). This was not strictly necessary, but it made the coding a bit easier. If you can't live with the styling restriction, let me know. I'll try to do what I can.You may report issues here.
May 17, 2009
Tuning LINQ performance with Mr. P and Mr. S
I thought I'd take a second look at the Mr. P and Mr. S problem, which I'd posted more than a couple of years ago. The last time I tried it, I wasn't successful. I had a strategy to solve it, but somehow I just couldn't translate it into code.
I've been programming a lot with C# lately, and decided to use LINQ to solve the puzzle. Although not very concise, compared to the Python and Haskell solutions out there, it does print out the right answer. After you've tried to solve it yourself, you can have a look at my solution here.
There's something special about LINQ queries. All LINQ queries are deferred, which means that they aren't executed until they are accessed. Also, they are re-executed when the execution context changes. Say we have a list of numbers, and a query on it like so :
The query hasn't been executed yet. We add a few numbers to the list, and compare the counts of the list and the query.
The test passes. LINQ queries are "live", very much like functions. Usually, this is a good thing, as no operation is performed until it is actually needed. However, there are exceptions. For example, I used these three ranges -
To define the
I posted a query on stackoverflow, and it did not disappoint. It is quite obvious in hindsight. This statement -
in deferred mode, turns out to be pretty expensive indeed. It contains a call to
A simple O(n) operation is now O(n3). We can do better than O(n), however, by using a Hashset.
It doesn't get much better than this.
I've been programming a lot with C# lately, and decided to use LINQ to solve the puzzle. Although not very concise, compared to the Python and Haskell solutions out there, it does print out the right answer. After you've tried to solve it yourself, you can have a look at my solution here.
There's something special about LINQ queries. All LINQ queries are deferred, which means that they aren't executed until they are accessed. Also, they are re-executed when the execution context changes. Say we have a list of numbers, and a query on it like so :
var numbers = new List<int>();
var query =
from i in numbers
select i;
The query hasn't been executed yet. We add a few numbers to the list, and compare the counts of the list and the query.
numbers.Add(0);
numbers.Add(1);
numbers.Add(2);
// 3 elements in list, 3 in the query
Assert.AreEqual(numbers.Count, localDeferredQuery.Count());
The test passes. LINQ queries are "live", very much like functions. Usually, this is a good thing, as no operation is performed until it is actually needed. However, there are exceptions. For example, I used these three ranges -
public static IEnumerableOddRange(int stop) // returns odd numbers upto "stop"
{
for (int i = 1; i < stop; i+=2) yield return i;
}
public static IEnumerableEvenRange(int stop) // returns even numbers upto "stop"
{
for (int i = 2; i < stop; i+=2) yield return i;
}
public static IEnumerableRange(int stop) // returns all numbers upto "stop"
{
for (int i = 0; i < stop; ++i) yield return i;
}
To define the
Deferred()
and Immediate()
functions below:
public void Deferred()
{
var all = Range(limit);
var even = from e in EvenRange(limit) where all.Contains(e) select e;
var odd = from o in OddRange(limit) where !even.Contains(o) select o;
var query = from q in odd select q;
foreach(var i in query) { var j = i+1; }
}
public void Immediate()
{
var all = Range(limit);
var even = (from e in EvenRange(limit) where all.Contains(e) select e) .ToArray();
var odd = (from o in OddRange(limit) where !even.Contains(o) select o) .ToArray();
var query = (from q in odd select q).ToArray();
foreach(var i in query) { var j = i+1; }
}
all
, even
and odd
are three sub queries, each using the previous one. The Immediate()
function only differs from Differed()
due it's forced execution of the subqueries with ToArray(). However, Immediate()
performs much better than Deferred()
. I knew LINQ operators are actually euphemism for functions, and that iterator blocks are actually exploded by the compiler into a lot of code. But Deferred()
was waaaayy slower than Immediate()
, and the time taken would increase exponentially with the value of limit
. This couldn't be just some extra code.I posted a query on stackoverflow, and it did not disappoint. It is quite obvious in hindsight. This statement -
var odd = (from o in OddRange(limit) where !even.Contains(o) select o).ToArray();
in deferred mode, turns out to be pretty expensive indeed. It contains a call to
even.Contains(o)
. While in the immediate mode this is an O(n) operation, in deferred mode, the sequence of calls looks like this -
odd --> even -+-> EvenRange()
|
+-> all --> Range()
A simple O(n) operation is now O(n3). We can do better than O(n), however, by using a Hashset.
var evenSet = new HashSet(even);
var odd = from o in OddRange(limit)
where !evenSet.Contains(o) select o; // Contains() is now O(1)
It doesn't get much better than this.
February 23, 2009
`calculable' textboxes using jquery
Here's some code to make your textboxes calculable. Demonstration here.
Yes, it uses eval. Doesn't evaluate on initial load. Provides no way to post the expression to the backend.
Expect improvements, and submit patches, once this goes into source control.
edit: Now available in a github repository.
(function($) { $.fn.calculable = function() { return this.each(function() { $(this).focus(function(e) { var expr = $(this).attr('calcExpr'); if (expr) { $(this).val(expr); } }); // end focus $(this).blur(function(e) { try { var expr = $(this).val(); var calculated = eval('with(Math){'+expr+'}'); $(this).attr('calcExpr', expr); $(this).val(calculated); } catch (e) { $(this).attr('calcExpr', ''); } }); // end blur }); // end each }; })(jQuery);
Yes, it uses eval. Doesn't evaluate on initial load. Provides no way to post the expression to the backend.
Expect improvements, and submit patches, once this goes into source control.
edit: Now available in a github repository.
Subscribe to:
Posts (Atom)