Wednesday, August 13, 2008 10:06 AM bart

Expression Tree Normalization – How Many Ways To Say “String == String”?

Recently I gave a talk at TechEd South Africa about Custom LINQ Providers. Unfortunately the session was only 60 minutes, so I had to cut down on the number of topics covered and focus on the bare essentials of query expression tree translation, including a primer to IQueryable<T>. However, the hidden slides – kept as reference material – contain a bunch of helpful tips to consider when implementing custom LINQ query providers and just one of those things that deserves some attention is the act of expression tree normalization.

But before we dig in, let me point out that this technique might or might not be applicable for particular implementations of query providers. There’s no silver bullet and the reason is very simple: when translating stuff (being expression trees) from one language into another, one core concern is matching up the semantics on both sides. How far this matching goes depends on a variety of factors such as the expressiveness of the target language (SQL, CAML, LDAP, …) but also how much you really care. Let me give you an example: do you care about the case sensitivity of string comparison? In other words, even though using C#’s == operator does a case-sensitive string compare for strings, does it mean that something like where p.ProductName == “Chai” also performs a case-sensitive string compare in SQL when using LINQ to SQL?

The thing I want to point out in this post is the fact that different people have different habits, mostly depending on personal background. To illustrate this, consider someone with a Java-background doing a string compare versus someone with a C++-background doing the same. It’s very likely that the former person will go for something of the shape s1.Equals(s2) while the latter person looks for a strcmp equivalent and might find s1.Compare(s2) == 0 a more natural way of thinking. Similarly, native C# and VB speakers will be so used to operator overloading (whether or not they really know this is a case of operator overloading is another point of discussion) and will write s1 == s2 or s1 = s2 respectively. But not only humans have habits, compilers have them too (and it’s far more difficult to change a compiler’s mind). The last two mentioned forms, s1 == s2 and s1 = s2, might be very similar, yet the generated code will be very different since VB uses a runtime library (Microsoft.VisualBasic.dll) that implements operations like string equality compare in such a way to preserve backwards compatibility with pre-.NET versions of the language.

All of this is quite abstract, a sample will clarify things:

using System;
using System.Linq.Expressions;

class S
{
    static void Main()
    {
        Test((s1, s2) => s1 == s2);
        Test((s1, s2) => s1.Equals(s2));
        Test((s1, s2) => String.Equals(s1, s2));
        Test((s1, s2) => String.Compare(s1, s2) == 0);
    }

    static void Test(Expression<Func<string, string, bool>> e)
    {
        Console.WriteLine(e);
        Delegate d = e.Compile();
        Console.WriteLine(d.DynamicInvoke("Bart", "John"));
        Console.WriteLine(d.DynamicInvoke("Bart", "Bart")); 
    }
}

Question is what this will print to the screen. Here is the answer:

image

What you’re seeing here is four different expression trees. Obviously all are LambdaExpression instances, but their Body is fundamentally different. The first one is a BinaryExpression with NodeType set to ExpressionType.Equal while the second and the third are MethodCallExpressions and the last one is a mix of the two (the top node will be a BinaryExpression with the left-hand side being a MethodCallExpression to String.Compare and the right-hand side a ConstantExpression with Value set to 0). And there are even more forms to consider (e.g. what about the last form with the == “arguments” swapped?).

To illustrate that VB has a different opinion about things, let’s do the same in VB 9.0:

Imports System
Imports System.Linq.Expressions

Module T

    Sub Main
        Test(Function(s1, s2) s1 = s2)
        Test(Function(s1, s2) s1.Equals(s2))
        Test(Function(s1, s2) String.Equals(s1, s2))
        Test(Function(s1, s2) String.Compare(s1, s2) = 0)
    End Sub

    Sub Test(e As Expression(Of Func(Of String, String, Boolean)))
        Console.WriteLine(e)
        Dim d = e.Compile()
        Console.WriteLine(d.DynamicInvoke("Bart", "John"))
        Console.WriteLine(d.DynamicInvoke("Bart", "Bart"))
    End Sub

End Module

producing the following result:

image

where the first expression is a MethodCallExpression with three arguments (the third one indicating case-sensitivity for the compare) to a method called CompareString defined in Microsoft.VisualBasic.CompilerServices.Operators. I’ve covered this VB Runtime method extensively in my post on runtime agility, a well-kept secret VB 9.0 feature.

So, how to deal with this flexibility (to use a gentle word)? The answer really is that your parser needs to know about all these possible ways of comparing strings – users can and therefore will use different ways to accomplish the same. Now, if you’re just implementing one provider and there’s only one place where you will encounter such checks in your provider (typically when parsing a Where operator’s predicate), it might be very doable to implement everything in place. But if you need to recognize all of those patterns in a variety of places, either intra- or inter-provider (or both), it might be useful to generalize the mechanism of recognizing those patterns and “normalizing” the different expression trees into one common form. Since expression trees are immutable this would mean using a visitor pattern to create a new, normalized expression tree, or you could just normalize expressions “as you go” while parsing certain parts of the giant query expression tree.

Anyway, here’s a (quick-and-dirty) sample implementation of a normalizer for string equality checks:

static Expression Normalize(Expression e)
{
    Expression left, right;
    left = right = null;

    MethodCallExpression mce;
    BinaryExpression be;

    if ((mce = e as MethodCallExpression) != null)
    {
        if (mce.Method.DeclaringType == typeof(string) && mce.Method.Name == "Equals")
        {
            if (mce.Method.IsStatic)
            {
                left = mce.Arguments[0];
                right = mce.Arguments[1];
            }
            else
            {
                left = mce.Object;
                right = mce.Arguments[0];
            }
        }
    }
    else if ((be = e as BinaryExpression) != null && be.NodeType == ExpressionType.Equal)
    {
        ConstantExpression ce;
        if (   ((mce = be.Left as MethodCallExpression) != null && (ce = be.Right as ConstantExpression) != null)
            || ((mce = be.Right as MethodCallExpression) != null && (ce = be.Left as ConstantExpression) != null))
        {
            if (ce.Value as int? == 0 && mce.Method.DeclaringType == typeof(string) && mce.Method.Name == "Compare")
            {
                left = mce.Arguments[0];
                right = mce.Arguments[1];
            }
        }
    }

    if (left != null && right != null)
    {
        e = Expression.Equal(left, right);
    }

    return e;
}

Notice the VB case isn’t covered here and is left as a trivial exercise for the reader (tip: the “== 0” check emitted by the VB compiler is a BinaryExpression where CompareString can either be left or right…) but more importantly the semantics of string comparison will be weakened by doing this normalization. Why? One thing, as mentioned before, is case sensitivity but there are others as well. What about the behavior of comparing null values (e.g. s1.Equals(s2) versus s1 == s2 where s1 = null)? What about culture-invariant compares versus culture-aware compares? And so on.

Moral of the story: know to what extent you want to preserve semantics (i.e. understand what you’re translating to) before you can even start to think about techniques like normalization.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under:

Comments

# Dew Drop - August 14, 2008 | Alvin Ashcraft's Morning Dew

Thursday, August 14, 2008 6:40 AM by Dew Drop - August 14, 2008 | Alvin Ashcraft's Morning Dew

Pingback from  Dew Drop - August 14, 2008 | Alvin Ashcraft's Morning Dew

# Arjan`s World &raquo; LINKBLOG for August 14, 2008

Thursday, August 14, 2008 12:04 PM by Arjan`s World » LINKBLOG for August 14, 2008

Pingback from  Arjan`s World    &raquo; LINKBLOG for August 14, 2008

# re: Expression Tree Normalization – How Many Ways To Say “String == String”?

Friday, August 15, 2008 7:23 PM by Aaron Erickson

I ran into this the hard way in i4o.... thanks for blazing a trail here :)

# WMOC#16 - Continuations, Currying and Recursion - Service Endpoint

Pingback from  WMOC#16 - Continuations, Currying and Recursion - Service Endpoint