May 17, 2009

Tuning LINQ performance with Mr. P and Mr. S

I thought I'd take a second look at the Mr. P and Mr. S problem, which I'd posted more than a couple of years ago. The last time I tried it, I wasn't successful. I had a strategy to solve it, but somehow I just couldn't translate it into code.

I've been programming a lot with C# lately, and decided to use LINQ to solve the puzzle. Although not very concise, compared to the Python and Haskell solutions out there, it does print out the right answer. After you've tried to solve it yourself, you can have a look at my solution here.

There's something special about LINQ queries. All LINQ queries are deferred, which means that they aren't executed until they are accessed. Also, they are re-executed when the execution context changes. Say we have a list of numbers, and a query on it like so :


var numbers = new List<int>();
var query =
from i in numbers
select i;


The query hasn't been executed yet. We add a few numbers to the list, and compare the counts of the list and the query.


numbers.Add(0);
numbers.Add(1);
numbers.Add(2);

// 3 elements in list, 3 in the query
Assert.AreEqual(numbers.Count, localDeferredQuery.Count());


The test passes. LINQ queries are "live", very much like functions. Usually, this is a good thing, as no operation is performed until it is actually needed. However, there are exceptions. For example, I used these three ranges -


public static IEnumerable OddRange(int stop) // returns odd numbers upto "stop"
{
for (int i = 1; i < stop; i+=2) yield return i;
}

public static IEnumerable EvenRange(int stop) // returns even numbers upto "stop"
{
for (int i = 2; i < stop; i+=2) yield return i;
}

public static IEnumerable Range(int stop) // returns all numbers upto "stop"
{
for (int i = 0; i < stop; ++i) yield return i;
}


To define the Deferred() and Immediate() functions below:


public void Deferred()
{
var all = Range(limit);
var even = from e in EvenRange(limit) where all.Contains(e) select e;
var odd = from o in OddRange(limit) where !even.Contains(o) select o;

var query = from q in odd select q;

foreach(var i in query) { var j = i+1; }
}

public void Immediate()
{
var all = Range(limit);
var even = (from e in EvenRange(limit) where all.Contains(e) select e) .ToArray();
var odd = (from o in OddRange(limit) where !even.Contains(o) select o) .ToArray();

var query = (from q in odd select q).ToArray();

foreach(var i in query) { var j = i+1; }
}


all, even and odd are three sub queries, each using the previous one. The Immediate() function only differs from Differed() due it's forced execution of the subqueries with ToArray(). However, Immediate() performs much better than Deferred(). I knew LINQ operators are actually euphemism for functions, and that iterator blocks are actually exploded by the compiler into a lot of code. But Deferred() was waaaayy slower than Immediate(), and the time taken would increase exponentially with the value of limit. This couldn't be just some extra code.

I posted a query on stackoverflow, and it did not disappoint. It is quite obvious in hindsight. This statement -

var odd = (from o in OddRange(limit) where !even.Contains(o) select o).ToArray();


in deferred mode, turns out to be pretty expensive indeed. It contains a call to even.Contains(o). While in the immediate mode this is an O(n) operation, in deferred mode, the sequence of calls looks like this -


odd --> even -+-> EvenRange()
|
+-> all --> Range()


A simple O(n) operation is now O(n3). We can do better than O(n), however, by using a Hashset.

var evenSet = new HashSet(even);
var odd = from o in OddRange(limit)
where !evenSet.Contains(o) select o; // Contains() is now O(1)


It doesn't get much better than this.