Some friendly folks have been making suggestions and asking questions. I'm still in the mode of figuring out what to do and how to do it. For those who are interested ...

From My Email

Chris Wheeler pointed me to the MemoryStream class in .NET. He points out that it has high performance, and is basically an addressable raft of bytes. I’m grateful for that lead, and will keep it in mind for some experimentation if I go with C#.

Chris went on to ask about two possible directions one could go with this stuff. Is the right thing to build some XST collections that understand set operations like restrict, or should one get down and dirty with the bytes? Here’s how I replied to him:

I understand. Yes. The whole project, if it were to become one, would need to involve both domain-level set operations, and then the low-level storage ones. A really interesting opportunity arises when you get to actually teaching the system to do limited "smart" set theory on its own.
For example, there is an operation that I call "uparrow", which can be seen as a domain restriction: given a list of values, return all the records whose domain (superscript) is in that list. If the records happen to be in an ordered set, which they always are, then the domain restriction means "return all the records whose record numbers are in this list".
So imagine that we had some set and wanted to restrict on a set that contained, e.g., LastName values. Instead of doing a regular restrict, we might do an operation that would search the main set and return the LastName values, and the domain value for the record in question. That would produce a new set with just LastName and the record number containing that LastName.
Then we do an operation called "image" which is just like restrict except that it returns only the rest of the input record, not the whole thing. The "rest" in this case is the record numbers. Then we do uparrow, which gives us the result we want. If the number of records selected by that operation is small relative to the size of the original data set, it won't cost much more that way, and we have built an index on LastName as a side effect. The next inquiry for LsatName processes the index, which is smaller than the original data, and is therefore faster.
Now it's easy enough to do that informally, as just discussed here. What's cool is that you can also write the equations for that: they represent an identity equation: this expression produces the same set as that one. So the "optimizer" in an XST system just does a little math, transforms the original query formally (and therefore more likely correctly) and executes it. Result: even more speed.

So where am I going with this Extended Set notion? Am I thinking that XST would be a good way to represent objects, collections, and searches, at the level of domain objects? Or is this a technology for implementing a database representation of objects? Either is certainly possible.

Domain, Database, or What?

Should we, could we, improve the domain language?

When we look at how relational databases are used today in languages like C# and Java, it seems as if there’s more of a divide than there should be. This shows up particularly in objects like C#’s DataSet, DataTable and such. These objects seem to be all about the database, and not much at all about my application. A DataTable is a collection of DataRows. A DataRow has a bunch of DataColumns. If I want to get something out of a DataRow, I use a message like myRow.Item(“LastName”). That’s not very pretty, and I’d rather deal with a class of my own, that has properties, so I can say, instead, something like myPerson.LastName.

Searching and finding things in these objects is weak and ugly. Here’s an example of Find on the .NET DataRowCollection:

private void FindInMultiPKey(DataTable myTable){
    DataRow foundRow;
    // Create an array for the key values to find.
    object[]findTheseVals = new object[3];
    // Set the values of the keys to find.
    findTheseVals[0] = "John";
    findTheseVals[1] = "Smith";
    findTheseVals[2] = "5 Main St.";
    foundRow = myTable.Rows.Find(findTheseVals);
    // Display column 1 of the found row.
    if(foundRow != null)
      Console.WriteLine(foundRow[1]);
 }

What are those values? Well, they are the primary key columns of the DataRows inside the DataRowCollection of the DataTable. We have to provide them in order.

Compare this with something we might say to a collection of objects in a Ruby collection:

myPeople.select {
   |each|
   each.firstName="John"
   && each.lastName="Smith"
   && each.address="5 Main St." }

Here, we’re talking directly to the objects, and we can ask them questions. We could even say something like this, to find everyone under 35:

myPeople.select { |each| each.age < 35 }

That’s just about out of the question with the data access objects of C# (and, as far as I know, Java).

In C#, would XST offer a better alternative? For the John Smith example, it might. We might say something like this:

  Person searchPerson = new Person("John", "Smith");
  Person foundPeople = myXstPeople.Restrict(searchPerson);

We could even imagine something using the field names, taking perhaps a HashTable as input. But it still wouldn’t be a pretty sight.

Equally interesting, Microsoft has announced a capability called LINQ that will allow “lambda expressions” (what Smalltalk and Ruby call closures or blocks) that will be able to be used uniformly on collections, XML, and databases. The examples they have provided look very interesting and powerful. Does this argue that there’s not much benefit to trying to improve the language aspects of things, at least in C#?

Should we, could we, focus on performance?

XST also offers some fascinating prospects for high performance. I mentioned that Dave Childs’ implementation from years ago can outperform Oracle and other nominally high-performance systems on at least some very large and complex queries. Even when we look at the simple restrict operation defined last time, there are some interesting performance opportunities.

If we process a collection using method calls to pull out values, we run a fair amount of code on each record. If we process it with restrict, instead of fetching each member variable, we just compare a sequence of bytes for equality. This can be done in the tightest inner loop of the search. This can be quite a bit faster: one thing I should do in this implementation is test that possibility early on.

And if the operation optimizes as hinted at in the quote above to Chris Wheeler, it can go even faster.

So we might build an XST library that, while it didn’t improve the language side of things all that much, provided high performance in a light-weight and easy to use library.

It's a Dilemma

At this writing, I’m not sure which way to go. I’m not at all convinced that we can improve the lot of the collection user much, and with LINQ in the offing, any benefit might be short-lived. Faster collections, though, might pay off in C#, so that could still be worth doing. Alternatively, it would be easier to do things, and easier to learn where the benefits lie, if I were to do the work in Ruby. And it would be a lot more fun, in many ways.

Such are the concerns when you try to invent a product based on a technical idea. There are lots of prospects, and ultimately you have to come up with stories that you can accomplish, and that offer a good chance of delivering value.

We’re not there yet, but it’s time for an experiment.

Next Steps: Learn Something

My inclination is to work on the high-performance end of the XST ideas. The language issues are fairly well-covered by existing features in Ruby, and the C# 3.0 capability also appears to cover much of what XST might offer, moving it closer to Ruby, though also, unfortunately, closer to the SQL style of thinking. If C# does it, Java can’t be far behind.

So the next step will be to experiment in Ruby to see what kind of byte structure can be built, and how to search it. At this writing, it looks like the object of choice will be the string type.

Next stop: restrict operator, on strings, in Ruby.