Saturday, March 26, 2005 8:11 PM bart

Adventures in Comega - part 3 (Content classes)

Introduction

One of the targets of the Comega language is to build a bridge between semi-structured data (read: XML) and objects. In future posts I'll describe how Comega fills the gap between relational data (read: SQL) and objects. But in this post, let's concrate on the former case.

 

About DTD and XSD

By itself, XML is nothing more than a large text string or text file containing semi-structured data separated and ordered by means of a tagging mechanism. Although the different fields can be distinguished, there's a stringent need to give fields a meaning by using types. There's another need too: that of being capable to express certain constraints on the usage of fields (for example: has to occur, is optional, can occur one or more times, etc). That's where DTD/XSD comes into play, also known as XML schemas. Of course, the .NET Framework supports this kind of stuff by default (using the System.Xml namespace) but Cw want to integrate these things deeper in the language itself.

 

Content classes - a first view

Let's take a simple example of a book (library) collection. As you know, a book has a title, one or more authors, an ISBN code and optionally you can categorize it in one or more categories. In DTD, this looks as follows:

<!ELEMENT Book (Title, Authors, ISBN, Categories)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Authors (Author+)>
<!ELEMENT ISBN (#PCDATA)>
<!ELEMENT Categories (Category*)>
<!ELEMENT Author (#PCDATA)>
<!ELEMENT Category (#PCDATA)>

As you can see, the symbols + and * are used to indicate respectively "one or more" and "zero or more". There's also the symbol ? that can be used to indicate "zero or one" (= optional). What we don't have here is strong typing.

As an alternative we can use XSD to describe the same structure:

<element name="Book">
  <complexType>
    <sequence>
      <element name="Title" type="string"/>
      <element name="Authors">
        <complexType>
          <sequence minOccurs="1">
            <element name="Author" type="string"/>
          </sequence>
        </complexType>
      </element>
      <element name="ISBN" type="string"/>
      <element name="Categories">
        <complexType>
          <sequence minOccurs="1">
            <element name="Category" type="string"/>
          </sequence>
        </complexType>
      </element>
    </sequence>
  </complexType>
</element>

Instead of using *, +, ? XSD is using the minOccurs and maxOccurs attributes for the tags. Functionally, it's the same and here we have strong typing of the elements.

Both structures can be used to define a book like this:

<Book>
  <Title>Title goes here</Title>
  <Authors>
    <Author>First Author</Author>
  </Authors>
  <ISBN>0123456789</ISBN>
  <Categories>
    <Category>One</Category>
    <Category>Two</Category>
  </Categories>
</Book>

However, it's far from cool to use this kind of data definitions inside code. Did you ever use XmlDocument (DOM) or other XML processing APIs like SAX? The construction of this kind of data objects is far from easy and looks rather clumpsy when viewed inside code. Luckily, there are a couple of ways to get around this, most notably the use of a strongly typed DataSet in the .NET Framework (created by using xsd.exe). But in the end, the internal representation of elements marked with ?, +, * is based on collection types and you get to see this directly, e.g. through the DataTable's Rows collection. Okay, you can iterate over it, but the translation battle going on to map both data representations is pretty visible.

So, how can Cw help us accomplishing a better model to cope with this semi-structured model in a more object-oriented fashion? The answer is content classes, which are based on the DTD syntax but have strongly typing aboard using the type model of the language and runtime (therefor every object can be used in the structure). Here's the book sample as a content class:

public class Book {
  struct {
    string Title;
    struct {
      string Author;
    }+ Authors;
    string ISBN;
    struct {
      string Category;
    }* Categories;
  }
}

Optional fields can be declared in a similar fashion using the ? symbol. For example, a book can have an optional URL with additional information and/or errata:

public class Book {
  struct {
    string Title;
    struct {
      string Author;
    }+ Authors;
    string ISBN;
    struct {
      string Category;
    }* Categories;
    string? URL;
  }
}

Nice, isn't it? Now, how to use this. The answer is again pretty simple and understandable: use XML inside the code, like this:

public Book GetSomeBook()
{
  return <Book>
           <Title>Title goes here</Title>
           <Authors>
             <Author>First Author</Author>
           </Authors>
           <ISBN>0123456789</ISBN>
           <Categories>
             <Category>One</Category>
             <Category>Two</Category>
           </Categories>
         </Book>;

}

In an analogous fashion one can declare and assign a variable using XML syntax, like this (you don't need to mention the type):

b = <Book>
      <Title>Title goes here</Title>
      <Authors>
        <Author>First Author</Author>
      </Authors>
      <ISBN>0123456789</ISBN>
      <Categories>
        <Category>One</Category>
        <Category>Two</Category>
      </Categories>
    </Book>;

Okay, looks pretty static right now, isn't it? How can we make it somewhat more dynamically so that we can construct a book with a given title and ISBN for example:

public Book GetSomeSpecificBook(string title, string ISBN)
{
  return <Book>
           <Title>{title}</Title>
           <Authors>
             <Author>First Author</Author>
           </Authors>
           <ISBN>{ISBN}</ISBN>
           <Categories>
             <Category>One</Category>
             <Category>Two</Category>
           </Categories>
         </Book>;
}

This will construct a book using the given data. Notice that in between the curly braces one can specify a full expression too (e.g. to create a sum of certain values). Notice you can still use the default constructor approach too.

Now, assume you have a Book instance, how to grab the data from it in order to display it, transfer it, or something else? Look at the following example:

public ProcessBook()
{
  Book b = GetSomeBook();

  foreach (it in b.Categories.Category) { Console.WriteLine(it); }
}

Notice the usage of the it iterator variable again (which is assigned the right type automatically). As you can see the b.Categories.Category is in fact equivalent to the XPath expression /Categories/Category that you'd use in classic XML processing in order to obtain the values. Queries (which will be explained later in another post) can be applied as well, including transitive queries (get all values associated with a certain "label" in nested structures, using the ... notation) and the use of member selection to obtain a stream of values (see previous post for more information about streams) which can be combined with the filter [...] syntax. So, as you can see, this technology is very very broad already.

 

Extending the content class

As a content class is a class, it can also contain other members, such as methods. In fact, the defined struct defines the data structure that class is representing, in another way than using standard private attributes. Logically, these methods will have access to the data "attributes" of the class too in order to manipulate the data or to query the data. In order to do this, declare a method inside the class definition. Now, assume that categories have a structure like this "maincat-subcat-subcat" and you want to determine whether a book is in a certain main category. However, we have multiple categories associated with a book. So, one of the approaches would be to use the foreach(it in ...) syntax to iterate over all the categories associated with the book instance. As an alternative let's use a so-called transitive query. By using this...Category we'll obtain a stream of all the categories associated with the current book instance. Then, we can use the :: operator to refine our result by applying a filter that returns on its turn a filtered stream. Together, this looks like this:

this...Category::*[SomeFilter(it)]

So, inside the filter we're using a method that gets the current value of the iterator that is doing the filtering (called it, as explained earlier). Last but not least, you need to define the "SomeFilter" method. As we only want to use it locally in our "main category boolean method", we can use something called nested methods in Cw. The total implementation is this:

public virtual bool HasMainCategory(string category)
{
  bool IsOfMainCategory(string category, string sel)
  {
    return category.StartsWith(sel + "-");
  };

  return this...Category::*[IsOfMainCategory(it, category)] != null;
}

So, if we find a category in the list of categories with the given main category, we'll return true, otherwise false.

 

What's the IL :-)

Time for the nerdy stuff, what's a content class translated to upon compilation? Again, let's investigate this incrementally. We'll kick off with a very simple sample:

class Test
{
  struct {
    string val;
  }
}

This is likely not that useful, but fairly interesting for sake of demonstration purposes. Compile and ildasm will give you this:

.field public valuetype StructuralTypes.Tuple_String_val sequence
.custom instance void [System.Compiler.Runtime]System.Compiler.AnonymousAttribute::.ctor() = ( 01 00 00 00 )

So, the compiler defines a "structural type" called Tuple_String_val, also declared as a sequence. Further examination of that helper class results in this:

.class public auto ansi sealed Tuple_String_val
       extends [mscorlib]System.ValueType
       implements [System.Compiler.Runtime]StructuralTypes.ITupleType,
                  System.Collections.Generic.'IEnumerable'

{
} // end of class Tuple_String_val

As you can see the class is derived from an ITupleType (an interface) and is a generic IEnumerable collection of strings too. Furthermore, there is a public field val (that we declared explicitly):

.field public string val

And the expected method GetEnumerator to get the enumerator:

.method public virtual instance class System.Collections.Generic.'IEnumerator'
        GetEnumerator() cil managed
{
  // Code size       12 (0xc)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  ldobj      StructuralTypes.Tuple_String_val
  IL_0006:  newobj     instance void System.Collections.Generic.EnumeratorTuple_String_val::.ctor(valuetype StructuralTypes.Tuple_String_val)
  IL_000b:  ret
} // end of method Tuple_String_val::GetEnumerator

This explains the possibility to use the foreach construct to iterate over the object.

Okay, time for something more. What about the ?, + and * symbols? Consider the following sample:

class Test
{
  struct {
    string* val1;
    string+ val2;
    string? val3;
  }
}

This is far more heavy when you look at the IL. For the *, not that much changes. The basic difference in the StructuralType is the fact that you end up with a collection instead of a simple string as the attribute:

.field public class System.Collections.Generic.'IEnumerable' val1

For the +, the situation is far more complex. First of all, there should be a val2 field in the Tuple_IEnumerable that looks like this:

.field public valuetype StructuralTypes.'NonEmptyIEnumerable' val2

Again it's a generic type created using the System.String type, but this time it's of the type "NonEmptyIEnumerable". That is exactly what + is supposed to do ("one or more"). So inside the StructuralTypes section you'll find this type declared. Inside it, you'll find mainly enumerator logic and quite a bit conversion functions (implicit/explicit) to convert to various helper types. The helper types (also in StructuralTypes) include NonNull and Boxed, both with a generic nature (in our case, typed with the System.String type). I won't cover these in much more detail right now.

And finally we have the ? operator that leads by itself to a Boxed type:

.field public valuetype StructuralTypes.'Boxed' val3

This type again implements the generic IEnumerable for System.String.

Combined alltogether you'll see a fairly complicated set of helper types popping up after compilation. Our Books sample for instance results in 10 helper types to be created. The nesting of the structs in our content type can be examined in that case and has the following look:

.field public valuetype StructuralTypes.'NonEmptyIEnumerable' Authors
.field public class System.Collections.Generic.'IEnumerable' Categories
.field public string ISBN
.field public string Title

So there are two other Tuple types for the nested structs. And on the class level the following declaration can be found:

.field public valuetype StructuralTypes.'Tuple_String_Title_NonEmptyIEnumerable_Authors_String_ISBN_IEnumerable_Categories' sequence

So, in the end two StructuralTypes are referred to in the declaration of the type: one for the authors and one for the categories.

Time to examine the constructor logic that is spit out by the compiler when it finds the XML declaration. In order to keep things (a bit) simple, let's use the following content class:

class Test
{
  struct {
    string* val;
  }

  public Test GetTest()
  {
    return blah;
  }
}

This is the result:

.method public hidebysig static class Test
        GetTest() cil managed
{
  // Code size       56 (0x38)
  .maxstack  5
  .locals init (class Test V_0,
           string V_1,
           class System.Collections.Generic.'List' V_2,
           valuetype StructuralTypes.'Tuple_IEnumerable_val' V_3,
           class Test V_4,
           class Test V_5)
  IL_0000:  newobj     instance void Test::.ctor()
  IL_0005:  stloc.0
  IL_0006:  ldstr      "blah"
  IL_000b:  stloc.1
  IL_000c:  newobj     instance void System.Collections.Generic.'List'::.ctor()
  IL_0011:  stloc.2
  IL_0012:  ldloc.2
  IL_0013:  ldloc.1
  IL_0014:  call       instance int32 System.Collections.Generic.'List'::Add(string)
  IL_0019:  pop
  IL_001a:  ldloca.s   V_3
  IL_001c:  ldloc.2
  IL_001d:  stfld      class System.Collections.Generic.'IEnumerable' StructuralTypes.'Tuple_IEnumerable_val'::val
  IL_0022:  ldloc.0
  IL_0023:  ldloc.3
  IL_0024:  stfld      valuetype StructuralTypes.'Tuple_IEnumerable_val' Test::sequence
  IL_0029:  ldloc.0
  IL_002a:  stloc.s    V_4
  IL_002c:  br         IL_0031
  IL_0031:  ldloc.s    V_4
  IL_0033:  stloc.s    V_5
  IL_0035:  ldloc.s    V_4
  IL_0037:  ret
} // end of method Test::GetTest

So, there's a call to add the "blah" string to the collection which is returned further on, after it has been wrapped into a Tuple_IEnumerable.

 

Question for you guys

There is some mistake in the previous sample. When you try to do this:

  public static Test GetTest()
  {
    return blahbla;
  }

you'll end up with this error message from the compiler:

test.cw(12,34): error CS2518: Invalid content 'val' in element 'Test', the content for this element is already complete.

Make a fix to the code in order to get rid of this problem. Tip: it's just a one-character fix. In the end, I want to be able to write this:

  public static void Main()
  {
    Test t = GetTest();
    foreach(it in t.val)
      Console.WriteLine(it);
  }

which should print

blah
bla

on the screen. Enjoy!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under:

Comments

No Comments