March 2009 - Posts

Before on “The M Programming Language Tales”:

Today, we continue our journey through the wonderful world of M by taking a look at MGrammar.

 

Why custom grammars?

Let me tell you a story first. About 10 years ago I came in touch with the early traces of the .NET Framework. One of its keys goals was to make deployment of applications easier (so-called xcopy deployment and elimination of the DLL Hell) and a key part of that mission was to make it easier for developers to configure their applications without relying on the registry. Hence, System.Configuration was introduced, together with the concepts of XML-driven configuration files like app.config and web.config. At the same time, static typing was believed to be the ultimate solution to every problem (a little exaggerating, but still) so together with those files came an effort to schematize those as much as possible, while still allowing developers to have their custom settings in there too: <appSettings> was born. A sample from MSDN:

<?xml version="1.0" encoding="utf-8" ?>
<appSettings>
    <add key="Application1" value="MyApplication1" />
    <add key="Setting1" value="MySetting" />
</appSettings>

Why am I telling you all of this? To draw your attention to a couple of things. First of all, a highly specialized (on a per-application basis) thing like configuration has been generalized piggybacking on a general data representation technology like XML. And although the whole thing is schematized, the only real way to give developers some storage of their own is this “appSettings” thing with very very generic child elements like <add>, <remove> and <clear>.

In other words, though we have achieved the goal of making configuration easier in the light of deployment, this approach failed in giving enough flexibility to developers to specialize the configuration store to their application’s needs. Next you want to have keys that contain collections or hierarchically organized settings, and all you have is key-value pairs. Adding syntax highlighting doesn’t help to “highlight” what’s important and what’s not (like keywords in a general purpose language indicate special meaning), everything looks flat:

<?xml version="1.0" encoding="utf-8" ?>
<appSettings>
    <add key="Application1" value="MyApplication1" />
    <add key="Setting1" value="MySetting" />
</appSettings>

There’s a positive thing to it though: our application is now metadata-driven. You could go one step further and say the application is model-driven for at least what its configuration is concerned. And that’s a good thing: no need to recompile to reconfigure the application. Gigantic applications like IIS and SharePoint take this to extremes: extending a SharePoint site is “simply” a matter of adding lots of configuration entries (most of which are automated through configuration and management UIs) to the SQL database that drives it. And that’s what modeling is all about. By the way, this approach of modeling applications has gradually increased over the years: COM+ had attribute-driven configuration means (e.g. to make an operation transactional), in .NET we introduced a generalized means to custom attributes, and XAML is also about capturing the “intent” of the developer in a structured way (as opposed to burying it under a load of (generated) code).

So, where are we with regards to XML as a means to make an application play better in the world of modeling? For one, we have a democratized means of storing and retrieving data in a (not-so-much-)human-readable and (again-not-so-much-)human-writeable way. Why am I using an expensive word like “democratized” in this context? Simply because XML is extremely handy to interact with from a programmatic point of view: we have a lots of XML libraries out there – MSXML, System.Xml and LINQ to XML come to mind – that make it incredibly easy to consume and produce XML data. Nobody has to think about lexing and parsing (to be explained further on) anymore, XML API + XML data = power. But there are still gradations to the ease of using XML. When there’s an XML schema (XSD) at hand, a mapping can be established to make the XML look like regular strongly typed objects (xsd.exe) but it’s far more common not to have an XML schema it turns out (think of REST services as opposed to WDSL-based web services).

And that’s where “M” comes in again, with a technology called MGrammar. There are quite some parallels to be made with XML:

  • MGraph ~ XML: using MGraph we declare data (like { X = 1, Y = 2 }), just like we do with XML (e.g. <Point><X>1</X><Y>2</Y></Point>). Another comparison you could make here is with JSON; it’s no coincidental that MGraph shares the philosophy of reducing syntactical noise of JSON.
  • MSchema ~ XSD: to capture the shape of MGraph data, one can use MSchema, just like XML data can be constrained by means of an XSD schema. This opens up for validation, tooling support, typing (structural or not). E.g. type Point { X : Integer, Y : Integer }.
  • MGrammar ~ XSLT^-1: although MGraph is less noisy than comparable XML data (no sharp brackets and closing tags anymore, long live curly braces) it’s still a way too generic tool in order for humans to interact with the data in a more natural way. What if we want to take a textual representation of our data (like (1, 2)) and turn it into data that follows a certain structure? That’s what MGrammar is for.

So, what we end up with is the concept of Textual DSLs (Domain Specific Languages):

image

Just like XML democratized the consumption and production of data (without the burden to configure databases and even to think about schematization if you’re lazy), MGrammar democratizes the use of a fit-n-finish textual representation of your data by means of a textual DSL.

An overview picture tells it all:

image

The MSchema and MGraph portions should be familiar, as we’ve seen those in previous posts. Today we’ll focus on the MGrammar portion of the picture.

 

MGrammar from a 10,000 feet view

One way to introduce MGrammar is from a tooling chain perspective: where does MGrammar fit in the end-to-end tooling story around Oslo? In the previous posts we’ve seen how to use M.exe and MX.exe (and you can envision much more built-in tooling support to hide those low-level details) to take in a model definition, compile and deploy it to the Repository. This part of the tool-chain is shown below:

image

Now the question is where those input M files come from? One answer is we could have written them by hand. That’s certainly a viable option but as pointed out before, we also want to get to a way for having more specialized input formats under the form of custom DSLs. The output of those DSLs should eventually lead to M files to be fed in to the compile-and-deploy chain, so we can put the following MGrammar-chain in front of it:

image 

The output of this chain is in the form of MGraph data, which can be fed into the M/MX tool-chain above. The left part of the picture is the development-time part of it: we write a custom grammar and compile it to an executable grammar (.mgx file) using the mg.exe tool. The right part of the picture is concerned with the execution of the grammar, given the executable grammar and an input file, resulting in an MGraph output, ready to be deployed to the Repository using the M/MX chain, or to be consumed by other means (as there’s a whole managed framework around MGrammar).

What can we compare MGrammar to? To a limited extend, you can compare it with XSLT, or better: the inverse of it. XSLT is typically employed to take an XML file in, and out comes a file in any (textual) format you see fit (HTML, text files, other XML files, C# files). This is a simpler transformation than the opposite, where any textual representation of data comes in (a C# file, a custom DSL, what have you) that we want to transform into structured data. Actually this is a very well-understood problem of lexing and parsing.

 

Lexing and parsing

Over the years many ways to turn text into more structured forms of data have been invented. Compiler writers use those techniques on a day-to-day basis but it’s hard to say this technique has been available to the masses (Mort, to name one). I won’t go into much detail about this, the interested reader should grab the Dragon Book from the shelve (or buy it if it isn’t there already – it’s worth a thousand fiction books as you can infer from the cover picture). Nevertheless, a short overview of the steps involved in turned text into something more actionable for further consumption:

  • Lexical analysis (or “lexing”) is the act of tokenizing the input text into individual tokens, which you can see as words in the source language. E.g. x => x + 1 could be turned into a sequence of the following tokens: ID(“x”), LAMBDA_ARROW, ID(“x”), PLUS, NUM(1). By itself, lexing can be divided into two parts: scanning and tokenizing. The first part takes in the source text as a stream of characters that drive a state machine in order to reach final states that denote lexemes, which are strings of characters that compose into a known kind of token (e.g. 123 would be a lexeme that can be turned into a NUM token). Known lexers include lex, flex, ANLTR, fslex, etc.
  • Syntactical analysis (or “parsing”) takes in a sequence of tokens in order to recognize well-formed sentences as described by a grammar. For example, the sentence of tokens for “x => x + 1” as seen above would be recognized (in C#) as a lambda function which is a well-formed kind of expression in that language. However, a sentence like “x => x 1” would not be recognized as valid input as the grammar has no production that states that a sequence of an ID and a NUM can be turned into a valid expression (in this case to become the body of the lambda expression). There are various forms of parsers, such as recursive descent, LL, LR and various variants. Known parsers include yacc, bison, fsyacc, etc.

Typically both the lexer and the parser are chained together to form the front-end of a language compiler. Part of the parsing phase is concerned with executing certain actions when a valid sentence is detected. For example, a sequence of EXPR, PLUS and EXPR could be turned into an object that represents the binary addition operator with the first EXPR as the left-hand side and the second EXPR as the right-hand side. Carrying out this process for an entire input ultimately results into a syntax tree that can then be consumed by the next phases of the compiler, to do type-checking, optimization, code generation.

Time permitting, I’ll blog about F#’s fslex and fsyacc tools too, to show an example of the fairly low-level plumbing that’s involved in dealing with lexers and parsers. However, rest assured: MGrammar is much easier to deal with :-).

 

Your first grammar

Okay, time to get the ball rolling. Let’s open up Intellipad in “Samples Enabled” mode (see the dedicated shortcut in the Start Menu for this mode) as MGrammar is only available in that mode currently. Switch to MGrammar mode in the right-top corner:

image

Notice the new menu entry that appears:

image

Don’t click any of this yet. Let’s write some MGrammar code first. As usual, the top-level container for M-stuff is a “module”, which controls visibility, import/export mechanisms, units of deployment, etc. Let’s call the one “Demo” to stick with a cliché and add a “language” declaration to it. This is where we enter the MGrammar part of the picture. As I didn’t want to settle for the boring “Hello World”, we’ll go for a FriendList language:

image

Notice the syntax highlighting for MGrammar; to Intellipad it’s just another DSL (this time a DSL to build DSLs in). Now the real work: inside the language definition we can write two things:

  • Tokens, to recognize words in the language.
  • Syntax rules (or productions), to recognize sentences in the language.

In the lex/yacc world, the starting production is indicated with a %start directive; in MGrammar patterns are followed that developers already know about: Main is the “entry-point” of the grammar. In passing, let’s reveal what Hello World would look like:

image

This language would only accept the literal string “Hello World” as its input. Notice this makes the language whitespace sensitive too: exactly one space is permitted between the words “Hello” and “World”. This can be remedied too, but let’s keep that for later. First, let’s focus on what valid sentences in our language should look like. Our DSL is going to be used to keep track of friend names and ages (feel free to party on the result to go for a more complicated language that accepts birthdays, food preferences, shoe sizes, and everything you find interesting to store about your friends). We want users to be able to specify this information using natural language, like this:

Bart is 26
John is 60

where “is” is recognized as a DSL keyword. What we’re seeing here are two sentences, each of which has one keyword and two tokens: one for the name and one for the age. Let’s start by defining what a name and age token look like.

First the name. A name consists of a sequence of letters, the first of which is capitalized, and let’s assume a minimum of one character. Regular expressions can be used to express those requirements, and in MGrammar we can use similar syntax to express this. One way to express this is as follows:

token Name = 'A'..'Z' 'a'..'z'*;

Notice a couple of things here: “..” is used to indicate a range. Concatenation is simply done by juxtaposition: in this case a capital letter followed by any number of lower-case letters. And finally Kleene operators like + and * can be used to indicate repetition. Although this is a good first attempt, we could split up matters a bit more: what about recognizing a lower case as a separate token, and similarly for upper case:

token Upper = 'A'..'Z';
token Lower = 'a'..'z';
token Name = Upper Lower*;

Next token to recognize is an age. For fun, let’s rule out the chance to have babies as friends (no meaningful conversation possible anyway), so let’s only recognize ages 1 to 99 (feel free to relax or constrain those rules further):

token Digit = '0'..'9';
token Age = '1'..'9' Digit?;

Here we state that an age is 1 to 9 optionally (? operator) followed by a digit from 0 to 9. So far, we have this:

image

Next we need to have a syntax rule (fancy word for sentence) that recognized a single friendship declaration in the form of Name “is” Age. Simple once more:

syntax Friend = Name “is” Age;

And finally, we need a way to express a list of friends for the main syntax rule:

syntax Main = Friend*;

The result is as follows:

image

All of this is nice, but how do we test the damn thing? Actually I’ve held back something all this time: the use of the “MGrammar Mode” menu. One option in there is the Tree Preview, which we’ll turn on now:

image

Intellipad will now ask for an input file to run against the grammar. Just create a new empty text file somewhere and point to it:

image 

The editor will now switch to a three-pane view that looks like this:

image

In the left pane, we can write input, resulting in an output MGraph on the right and possible errors at the bottom. Let’s give a try and write a sentence for myself (I adore myself as a friend, ahum):

image

Something isn’t quite right. The parser is telling us that the empty string “ “ around the “is” keyword is invalid. We need some way to accept spaces there, and preferably one or more whitespace characters as we want some flexibility. An interleave rule is what we need:

image

That’s better (although there’s a remaining problem when dealing with multiple friend sentences, try to figure out what whitespace characters we need to add in addition to ‘ ‘ and ‘\t’), but take a look at our output: not quite how we want it. All we want is a piece of data stating there’s a friend called “Bart” with age “26” but we don’t want to see the “is” keyword anymore. To make this work, we can extend the syntax rule with a production, stating how the parts of the input (Name and Age) need to be turned into a data representation of our choice:

syntax Friend = n:Name “is” a:Age => Friend { Name = n, Age = a };

Think of the => as “produces” or “is turned into” where the right-hand-side is in whatever shape we want the data to be in. Notice how the parts of the sentence are prefixed with names so they can be referred to in the production’s output (this is far easier to understand than numeric reference schemes as used often in YACC and such). Notice how the output is much nicer now:

image

Finally we could clean up the Main rule too (to showcase another capability of MGrammar), like this:

syntax Main = l:Friend* =>
    Friends { valuesof(l) };

Here we use the valuesof keyword to extract the values of a list. This is handy to flatten lists produced by certain rules, as will become apparent if you go for more complicated grammars (maybe a next time).

 

Compiling and executing the grammar

We’ve already been interacting with the grammar through Intellipad above, but to make it really useful we’d like to use it in an executable format. There are various approaches to be taken here:

  • Compile the grammar using mg.exe and execute it using mgx.exe.
  • Embed the grammar in a .NET application and consume output of a parse-run in there (as rich .NET objects).

For the purpose of this introductory post, we’ll go for the former option and defer discussion of .NET integration till a later post. So, save the grammar in an .mg file and open a command prompt with the PATH set to include the location of mg.exe (%programfiles%\Microsoft Oslo SDK 1.0\Bin). First thing to do is to compile the grammar, as follows:

mg grammar.mg

This will produce an mgx file, which is the executable grammar format:

image

Actually an MGX file is (once more) an OPC file, which is a ZIP file containing XML definitions of the grammar. You can verify this by copying the file to a copy with a .zip extension, open it, and inspect its contents:

image

The Demo.FriendList file is an XML file representing the grammar of our language. In there you’ll find some information about the productions and a lexer table (in compressed format). But that would lead us too far. All we need is something that knows how to execute the grammar, and mgx.exe is one such tool:

mgx test.txt /r:grammar.mgx

This produces an .m file with the output of the parse run:

image

 

Conclusion

MGrammar makes it very easy to produce M content based on a custom DSL specification. In this post you saw how to use Intellipad to author an MGrammar specification, test it against some sample input, and shape the output. To finish off this post, we looked at the MGrammar tool-chain used to compile and execute the grammar.

In subsequent posts we’ll take a look at MSBuild support to compile MGrammar specifications and how to leverage the power of MGrammar from within a .NET application using the Oslo SDK.

Have fun!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Introduction

On my last trip I had the opportunity to talk on the subject of LINQ once more. Geeky as we are, this time’s session title was “LINQ in breadth”, an orthogonal view on LINQ compared to my last year’s LINQ in depth talk.

But what makes LINQ have a certain breadth? The simple answer: its spectrum of possible LINQ “provider” implementations. Far too often people think LINQ is only extensible by means of the IQueryable<T> interface which makes you opt-in to every imaginable query operator. That’s one useful side of the LINQ spectrum (albeit most of the time an overkill approach). The totally opposite end of the spectrum is ironically IEnumerable<T>. I say ironically because theoretically both interfaces are identical (I should really go into monads once more, but let’s suppress that urge). It’s just that making something LINQable is trivial if one can expose the data as an IEnumerable<T> because from that point on one can fall back to LINQ to Objects (and that’s precisely what LINQ to XML does).

The rest of the breadth comes from all intermediate implementations that are possible. I’ve mentioned before that, from a language perspective, LINQ is a tiny syntactical layer on top of the rest of the language. To be convinced about that, look at my C# 3.0 Query Expression Translation Cheat Sheet. All you’re seeing there is some kind of fixpoint algorithm that gradually compiles away the C# 3.0 language’s knowledge of query comprehension syntax till nothing but methods and operands remain.

All of this means we can party along on the syntactical sugar that LINQ provides us, as shown in my “Who ever said LINQ predicates need to be Boolean-valued?” post. It turns out I’ve talked about this spectrum theory over there too. All of this can be used for samples that are only of technical value, like LINQ to Simpsons (see below), but also to create useful stuff.

image

(Exercise for the reader: what’s the type of x in the fragment above?)

 

The genesis of ExceLINQ

So, while preparing for my demos, I was thinking about potential samples of non-obvious but nevertheless useful LINQ implementations. Rather by accident I came up with bridging Excel to LINQ. Maybe a short story on how that idea came along. One evening (night according to the post’s publication time) I was writing up a blog entry on “The M Programming Language – Part 0 – Intellipad, MrEPL, VS2008 project support and Excel integration” (more posts on that topic to follow soon). As I’m not the kind of guy that spontaneously opens Excel – which doesn’t mean I don’t like the product, I just don’t have a job that requires overdoses of spreadsheet interaction on a day-to-day basis – I took the opportunity to browse a bit more through the product in its current form. The last screenshot is where the fascination kicked in:

Clicking the little arrows on the column headers reveals a whole new world (I admit I was exposed to it in a distant past, but it was a pleasant deja-vu as you can tell, otherwise this post wouldn’t exist):

image 

Query operators, anyone? As I already had a C# project open to prepare C# 4.0 demos on improved COM interop (posts on that to follow too), I decided to explore the Excel automation APIs a bit more. Here are a few of the interesting methods I came across:

  • Range.AutoFilter – filters based on one or more predicates
  • SortFields.Add – and much more plumbing to do sorting
  • _Application.Union – usable for simple column projections; a COM interop beauty with no less than 30 parameters, 28 of which are optional (luckily C# 4.0 will help out here – yes, you can say “finally” if you want)

Great, lots of stuff that’s applicable to querying. But let’s not settle for the obvious: creating a LINQ binding that results in yet another IEnumerable<T> where T is an entity type representing a set of strongly-typed columns (with an ExcelMetal.exe export tool, sounds familiar doesn’t it?). To illustrate the flexibility of LINQ I wanted the query to return something non-obvious. What about an Excel chart? In preparing my C# 4.0 talk I’d been exposed to Office interop code to create a chart in Excel, so that was certainly doable too.

Ah beautiful – it was the first time I didn’t curse myself for going on a speaking tour with 7 distinct talks. After all my M talk led me to dive into Excel, link it to C# 4.0 COM interop to end up with something applicable for my LINQ talk. And obviously I added PowerShell to the ingredients by showing off LINQ through PowerShell during that talk too :-).

 

The mission statement for ExceLINQ

Another great distinction between developers and managers: the latter need to open up Word to come up with a mission statement, while the former group opens up their favorite IDE to cook up a prototype. So in the true sense of “say it with code”, here’s what I want to accomplish:

using (var src = new ExcelSource<Proc>(_file))
{
    var res = from proc in src
              where proc.WorkingSet > 20 * 1024 * 1024
              where proc.Name.StartsWith("D")
              orderby proc.WorkingSet
              select new Column[] { proc.Name, proc.WorkingSet };
    pictureBox1.Image = res.ToChart(XlChartType.xlColumnClustered);
}

Or more graphically, a press of the button results in a chart being rendered:

image image

We could go even further and allow any IEnumerable<T> to be imported into Excel transparently in order to create a chart, but in the sample above I’m simply using an existing Excel file (which I populated manually using Excel interop code). But hey, it’s a proof-of-concept.

 

The construction of ExceLINQ

There’s a magical touch to ExceLINQ. The query shown above does look like a regular LINQ query, yet there’s a bunch of stuff going on even in just declaring the query. What if I told you that the > operator in the first where clause is not the one you’d expect? And what if I told you that none of the query operators you’re seeing is defined in the same spot (like Enumerable or Queryable)? Indeed, everything is custom in ExceLINQ: a la carte cuisine, no microwave food.

Regular readers of this blog will be bored by what follows in this paragraph (hopefully not by the things coming in the subsequent ones), but the query fragment above translates into:

src.Where(proc => proc.WorkingSet > 20 * 1024 * 1024).Where(proc => proc.Name.StartsWith(“D”)).OrderBy(proc => proc.WorkingSet).Select(proc => new Column[] { proc.Name, proc.WorkingSet })

I kept everything on one line deliberately to emphasize the fluent interface LINQ is. At this point there’s no query comprehension syntax left and the compiler falls back to the well-known rules for method overload resolution, processing from the left to the right (duh). In other words: given src which is of type ExcelSource<Proc>, can we find a method called Where that takes one argument with type …:

src.Where(proc => proc.WorkingSet > 20 * 1024 * 1024)

Right, but what’s the type of the argument? We can only tell once we know what the type of proc is, so we can compute the type of the lambda body. One thing’s sure: the “shape” of the argument is required to be a Func<A, B>, a function from some type to another type (type theorists will see a relationship to the concept of kinds).

All we need to see from this discussion is that we can provide our own Where method (possibly an instance method, or an extension method but there’s no reason to choose for the latter option in this case) and as long as it has the right shape of signature, the compiler will be happy. Let’s reveal its signature now:

public Query<T> Where(Func<T, Filter> filter)

We see the generic parameter T comes from the surrounding type, ExcelSource<T>. Obviously T is meant to be substituted for an entity type, a strongly-typed representation of the columns in the worksheet. This mapping is nothing fancy:

[Sheet("Sheet1")]
class Proc
{
    [Column("A")]
    public StringColumn Name { get; private set; }

    [Column("B")]
    public IntColumn WorkingSet { get; private set; }
}

Ignore the custom attributes for now, those are true details required for the interop plumbing to correlate entity types to sheets and properties to columns in that sheet. Trivial stuff to define and to use. What’s more important is the typing we’re looking at. Name is not a string, it’s a StringColumn. Similarly WorkingSet isn’t an int, but rather it’s an IntColumn.

So, looking back at our Where method, “proc” ought to be an instance of Proc, hence the lambda body should be the type of:

proc.WorkingSet > 20 * 1024 * 1024

We know WorkingSet is of type IntColumn, so the > comparison operator can only come from an operator overload defined on that type:

class IntColumn : Column
{
    public static Filter operator >(IntColumn column, int value)
    {
        return new Filter(column, FilterType.GreaterThan, value);
    }

    …
}

As you can see, the return type of this operator is Filter, so the entire lambda expression is assignable to the Func<T, Filter> and the call to Where is working out fine:

src.Where(proc => proc.WorkingSet > 20 * 1024 * 1024)

returns a Query<T> instance as seen from the Where method signature. Next in the chain of method calls, we find another call to Where:

src.Where(proc => proc.WorkingSet > 20 * 1024 * 1024).Where(proc => proc.Name.StartsWith(“D”))

So Query<T> should again have a Where method defined, an indeed it does:

public Query<T> Where(Func<T, Filter> filter)

Now proc.Name is of type StringColumn, so the StartsWith method should be the guy returning the Filter instance required to make the Where method happy:

class StringColumn : Column
{
    public Filter StartsWith(string value)
    {
        return new Filter(this, FilterType.StartsWith, value);
    }
    …
}

You get the idea. In a very similar way the OrderBy method is defined on Query<T>. This time it doesn’t take in a lambda producing a Filter (as ordering is not filtering, right?), but rather a lambda that produces a Column. After all the function required to carry out an ordering is a so-called key extractor (and we don’t bother about things like composite keys here):

public Query<T> OrderBy(Func<T, Column> column)

As both properties on Proc are typed to be *Column, each of which derives from the Column base class, the OrderBy call succeeds:

src.Where(proc => proc.WorkingSet > 20 * 1024 * 1024).Where(proc => proc.Name.StartsWith(“D”)).OrderBy(proc => proc.WorkingSet)

The final call is a Select call, which is typed as follows:

public Result<T> Select(Func<T, Column[]> project)

Notice two things here:

  1. The return type of the projection lambda is not generic (as in other LINQ implementations, where the projection function would be a Func<T, R> where R can be anything, e.g. an anonymous type). This way we make sure the user just returns a set of columns with no fancy decoration over it (like new { Bar = x.Foo } which results in an anonymous type).
  2. The return type of the whole Select query operator is of type Result<T>, no longer Query<T>. This means we can’t continue our query beyond the Select call (assuming we don’t provide methods like Where, OrderBy, Select on Result<T>, which we don’t). By doing this, we can rest assured the user doesn’t “keep going on” and our query is terminated properly.

The last point is quite important. In fact we’ve created a state machine of types. Starting from an ExcelSource<T>, we transitioned into Query<T> to end up with Result<T>. You could go much further in controlling the order of operators and the number of operator calls that are allowed (e.g. a maximum of two Where calls) but creating much more types:

image

Finally we have a Result<T>, which has a method to turn the result into a chart:

public Image ToChart(XlChartType chartType)

where Image is the built-in System.Drawing.Image class from the .NET Framework.

 

The plumbing of ExceLINQ

We’re done with the conceptual part of our mission. The user can write queries that result in calls being made to our own methods on types like Query<T>. Even more, we gained control over the lambda expressions that are passed to those query operators too. So we have all the information about the query the user intended to write, meaning we can turn it into execution.

That’s where the plumbing part comes in. The implementation of our query operator methods needs to collect all the information about the query. Once we have aggregated all of this knowledge into a Result<T> object, we need to turn it into instructions that Excel understands. And finally, we ask Excel to generate the graph upon calling Result<T>.ToChart, producing an Image instance somehow. All of this is quite irrelevant to the LINQ topic in this post.

I’ll omit all of the COM interop magic as it doesn’t contribute much to the core idea, but there’s one thing I should dive into a bit more here. We’ve seen that purely based on the type information available through ExcelSource<T>, more particularly because of the T part in there, we can write strongly typed queries over some entity type like the one below:

[Sheet("Sheet1")]
class Proc
{
    [Column("A")]
    public StringColumn Name { get; private set; }

    [Column("B")]
    public IntColumn WorkingSet { get; private set; }
}

We’ve also seen that we defined operators like Where based on such an entity type, e.g. taking in a Func<T, Filter> where T is substituted for the particular entity type we’re dealing with:

where proc.WorkingSet > 20 * 1024 * 1024

The reason we did that is to be able to call the (in this case predicate-valued) delegate to get the corresponding filter object (i.e. the runtime representation of the > condition applied in the predicate) without further runtime hoops. Those hoops would come from the IQueryable style of query provider implementation where the predicate would not be a Func<T, Filter> but an Expression<Func<T, bool>>, hence the user could write all sorts of expressions that ultimately produce a Boolean value, and although the compiler would be fine with that, we might give up at runtime because the generated expression tree is too complex (e.g. when the user wrote something like “where CallGreenManOnMars(proc).Replied(“Okay”)”, which doesn’t make sense but type-checks fine). In some sense, we pushed down the Expression<…> part into the function signature to constrain the lambda body to something of type Filter, which is our mini (domain-specific) expression tree.

There’s something missing though: in order to get the result of the predicate delegate, we need to feed it an instance of the entity type T, in this case a Proc instance. Moreover, we need to make sure all properties are properly instantiated because we’re about to call into them during the execution of the lambda body, e.g. by doing “proc.WorkingSet”.

The way we accomplish this is by new’ing up the entity type at runtime and filling in the properties dynamically:

internal static T CreateInstance<T>()
{
    T res = Activator.CreateInstance<T>();

    var mappings = from prop in typeof(T).GetProperties()
                   select new { Property = prop, Attribute = prop.GetCustomAttributes(typeof(ColumnAttribute), false).Cast<ColumnAttribute>().Single() };

    var columns = mappings.OrderBy(column => column.Attribute.Column);
    var first = columns.First().Attribute.Column[0];

    foreach (var col in columns)
    {
        object colObj = col.Property.PropertyType.GetConstructor(BindingFlags.Instance | BindingFlags.NonPublic, null, new [] { typeof(string), typeof(int) }, null)
            .Invoke(new object[] { col.Attribute.Column, (int)(1 + col.Attribute.Column[0] - first) });
        col.Property.SetValue(res, colObj, null);
    }

    return res;
}

Let’s step through this a bit. On the first line we create an instance of T, nothing fancy (an alternative would have been to have a generic parameter constraint for a default constructor, it just turned out this way for no particularly good reason and, hey, this is just a sample). Next we use LINQ to Objects (don’t you love it?) to get all the properties and their ColumnAttribute mapping. I’m using the Single operator here, so foreign properties (which don’t have a mapping attribute) will cause our CreateInstance method to fail. I could be more polite and ignore those properties or define my own exception type here (which I should really have done) but you get the idea.

Finally, the loop going over the columns is required to instantiate each individual property. Again, we’re a bit lazy here and expect the properties to be of a Column-derived type we know about, as we assume a two-parameter private constructor is available to new that up. To make this code production quality you’d obviously go for more thorough checking code with proper exception types being used to signal failures. The bit of tricky code here is to order the columns, so that we can determine the column index in the spreadsheet, required for other COM plumbing code where we select Excel ranges. This is just an implementation detail and could be eliminated with some more careful study of the Office automation libraries (but I’m not an Office programming expert at all). Here’s a sample column type with the constructor used:

class Column
{
    protected Column(string name, int index)
    {
        Name = name;
        Index = index;
    }

    public string Name { get; private set; }
    public int Index { get; private set; }
}

class
StringColumn : Column
{
    internal StringColumn(string name, int index)
       : base(name, index)
    {
    } 
    …
}

 

The state of ExceLINQ

As mentioned repeatedly during my post write-up, ExceLINQ is more of a conceptual rapidly-prototyped piece of demo code, so there’s lots of room for improvement left to be made:

  • General code clean-up, proper error handling and use of C# 4.0 dynamic to reduce COM interop plumbing.
  • Query<T> mutates its own state on every query operator call; it would be better to return new instances on every call as queries might act as the base for other queries (a common pattern).
  • A better design would be to go for the “type state machine approach” as outlined in this post, to restrict the order and multiplicity of query operators.

Anyway, if you’re curious to see how all of this works, I’ve uploaded the prototype over here. Usual warnings about sample code apply. Below is a nice scenery of ExceLINQ stepping through:

image

Now pressing F10 applies the first Where clause, causing the filter to be applied in Excel instantaneously.

image

Notice the filtering icon in the first row now. This live filtering makes ExceLINQ an ideal visualization of query execution, which was a big appealing factor to use it for my “LINQ in Breadth” demo. Next we apply ordering:

image

and again, continuing execution causes immediate filtering application (notice the little sorting arrow in cell B1):

image 

Enjoy!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

More Posts