Tuesday 30 April 2013

Lexical Closures

Today, I found the first confusing bug in the software I've been working on. It had me going for pretty much the whole day, not helped by the fact that my MonoDevelop also had a bug causing it to crash when the IKVM runtime threw a meaningless exception, meaning I couldn't actually get to see any of what was happening.

I've been using a fairly out of date MonoDevelop, as it's the one available from the ubuntu repos. Strangely enough, recent builds on the website are only available for Windows, OSX and Suse.
Luckily, there doesn't seem to be any proper reason for this, so I basically did what this page told me to and built MD 3 from source. So far, it works a treat and doesn't seem to feature the annoying bug I mentioned. Nice :)

Anyway, the bug. It seems really stupid now, so in order to feel less stupid I'm going to turn this into a post where I explain lexical closures in the context of C#.

So, what is a lexical closure? Why, it's a function object which captures the bindings of free variables (variables which aren't defined in the function or it's parameter list).
I'll elaborate:

Delegates

If you've written C# before, you've probably used function objects or delegates, as they call them:

// Declare the delegate type outside a function somewhere:
delegate int OneAdder(int a);

// Instantiate the delegate
OneAdder addOne = delegate(int a) { return a + 1; };

// Use the delegate to print a "4"
Console.WriteLine(addOne(3));


You can also use the lambda syntax to more succinctly create delegates:

OneAdder addOne = (a) => a + 1;

There's more; From .NET 3.5 there are some standard generic delegate types already predeclared:

Action<int> : void-returning function with an int parameter
Action<int, int> : void-returning function with two int parameters
Func<int, string> : string-returning function with an int parameter
Func<int, int, string> : string-returning function with two int parameters

As you might intuit, delegates are reference types just like classes and amenable to the same assignment/passing-around semantics. For instance, one could write a function:

static int CallMyDelegate(Func<int, int> func)
{
 return func(5);
}

Lexical Closure

Delegates in C# have 'Lexical closure'. That means that when you call this, passing it 5:

static Func<int, int> MakeAddingFunction(int number)
{
 return delegate(int a) { return a + number; };
}


What you get back is a function which adds 5 to things.
It's called a lexical closure, because the function 'closes over' the 'number' variable (which is in its lexical scope), so it can use it later. One can also say the delegate 'captures the lexical environment'.
Think on that for a while until you understand it. Try it out, even.

This whole delegate thing is really useful as it gives us a nice, succinct way to abstract some operation in terms of it's inputs and outputs, so it can be passed to code which can then apply it to some data without understanding the operation. C#'s "Linq" is based on this kind of thing. If you haven't heard of Linq go and read up on it once you've finished reading this.

Peculiarities

Lexical closure is about capturing the *bindings* of variables in scope, not the *values* of them.
The binding is the invisible thing which is given a name and associated with different values over the variable's lifetime.
This means that the delegate has full access to the variable, rather than just receiving some copy of its value.
Consider this code:

static void Main(string[] args)
{
 int someNumber = 3;
 Action<int> addThis = delegate(int a) { someNumber = someNumber + a; };
  
 addThis(10);
 
 Console.WriteLine(someNumber); // This prints "13"
}


That's right - it prints 13! The someNumber variable was modified by addThis.
Whenever you open up a function, a loop, or an if block, or anything with curly braces, that conceptually creates a new 'lexical environment' and a new set of bindings for all variables declared within those braces. In the case of loops, a new environment is created for each iteration, so you can bind to different variables each time round, even though they have the same name:

var setters = new List<Action<int>>();
var getters = new List<Func<int>>();

for(int i = 0; i < 3; i++)
{
 int closedOverVariable = 0;
 setters.Add(delegate(int newValue) { closedOverVariable = newValue; });
 getters.Add(() => closedOverVariable);
}

// Set the first int to 5
setters[0](5);

// Set the second int to 100
setters[1](100);

// Print the value of the first int : "5"
Console.WriteLine(getters[0]);

// Print the value of the second int : "100"
Console.WriteLine(getters[1]);

We just created three ints with no way to access them apart from via the appropriate getter and setter delegates. I always found that kinda spooky; It's like the ghost of a variable...

The Bug

So where did it all go wrong today? Here's a piece of code which demonstrates what I got wrong:

// A nice array of ints
int[] ints = new[] { 1, 2, 3 };

// A list of delegates
var delegates = new List<Func<int>>();

// Lets go through the ints, and make a 
// delegate for each one which returns the corresponding int.
foreach(int i in ints)
{
 delegates.Add (() => i);
}

// Now go through all three delegates, printing the return values.
// It should go "1, 2, 3", right?
Console.WriteLine (delegates[0]());
Console.WriteLine (delegates[1]());
Console.WriteLine (delegates[2]());


WRONG! It will go "3, 3, 3"!

That's because the foreach statement in C# doesn't create a new binding for 'i' at each iteration. It behaves more like 'i' is declared outside the loop and the value reassigned 3 times. Each new delegate we're creating is getting the same binding - accessing the same variable. I find this a little counter-intuitive, as I would expect there to be 3 different bindings created in 3 different lexical environments as we iterate over the ints, just as there would if 'i' had been declared inside the loop.
That last part is actually the way to fix this mess:

int[] ints = new[] { 1, 2, 3 };

var delegates = new List<Func<int>>();

foreach(int i in ints)
{
 // 3 different bindings will be created for 'tmp'
 // in 3 different lexical environments.
 // We're initializing it with the *value* of 'i'
 int tmp = i;

 // Close over the binding of 'tmp', rather than that of 'i'
 delegates.Add (() => tmp);
}

Console.WriteLine (delegates[0]());
Console.WriteLine (delegates[1]());
Console.WriteLine (delegates[2]());



And that's all, folks. Hopefully I've helped someone understand lexical closures better. Maybe I've even managed to steer someone away from falling into this trap.

Happy coding :)

No comments:

Post a Comment