XST: Considering Some Exotic Transformations

We're walking a narrow path bounded by theory and practice. This time we'll do some work on those records with holes in them, like the first-name set. In so doing, we'll try not just to hack in the code, but to keep it consistent with our understanding of the math. We don't really understand the requirements, or the theory. Can we possibly succeed?

Handling Fields

As our set processor grows up, we’ll want to be handling named fields such as “LastName”, “FirstName”, and the like. I anticipate using Ruby’s “hash” capability to express the records of such sets, something like the examples in this test:

    def test_hash
      h = { :LastName=>"Jeffries", :FirstName=>"Ron" }
      assert_equal("Jeffries", h[:LastName])
      s = [ { :LastName=>"Jeffries", :FirstName=>"Ron" },
            { :LastName=>"Hendrickson", :FirstName=>"Chet" } ]
      assert_equal("Chet", s[1][:FirstName])
    end

For the Ruby-deprived, the curly brackets enclose a literal hash, and the :LastName things are “symbols”, which are like strings except that whenever you refer to the same symbol anywhere in the program, you get the same identical instance. Two uses of the same string (not symbol) are not guaranteed to be identical. I’m not thinking that the identity thing will be significant but am playing with the idea of suggesting that the field names of a set should be symbols.

The square brackets bound an array literal, so the second example, s, is an array of two hashes, which I mean here to represent a set of two records.

But for now, the point is that some more or less handy notation needs to be devised to let people create records and sets fairly conveniently in the Ruby language, and those examples might represent one way to do it. What’s on my mind tonight is what those things might mean and how they relate to our flat, string-based set.

I want to use the flat sets because they offer prospects of very fast processing for very large collections of data. And of course I’ll want some reasonably easy form for referring to sets, creating them taking them apart, and the like. This means that I’ll need some way of defining the mapping from one of those hash-style sets above, and the flat record format. The issue isn’t too deep, you just need to know, for each set, the origin and lenth of each field.

At least that’s true if the records are all the same. What if we had a set like this one:

    def test_mixed_set
      s = [ { :LastName=>"Jeffries", :FirstName=>"Ron" },
            { :Age=>35 } ]
      assert_equal( 35, s[1][:Age])
    end

Set-theoretically, this is a perfectly good set, with two records (which are also sets), one with two elements, a LastName element “Jeffries”, and a FirstName “Ron”. The other record has only one element, with Age = 35. While this would not be a valid relation, it is certainly a valid set in Extended Set Theory. We may wish to represent this set in a flattish format, and I was thinking about how we might do so.

Code ... and Sets

In code, we can surely figure out ways to represent a set like this. We could even pack up the names and values in an XML kind of format:

    <XSET>
      <XSET LastName="Jeffries" FirstName="Ron"/>
      <XSET Age="35"/>
    </XSET>

But if we’re going to represent things in flat byte sets, we will more than likely want to keep the mapping between field name (:LastName) and field position and length (say, 0, 20), separately, so that the byte operations can just run on numbers without needing to look things up. We have already seen how the elementary operations where everything is flat and lined up can work by just comparing corresponding bytes. But what might “corresponding bytes” mean if the records are not all uniform?

One way is suggested by Childs’s Scope Transform. Recall that that’s a set of scoped values, where the element represents the old scope and the scope represents a new one. Applying a Scope Transform to a set essentially renames the fields. If the field names are compact integers, the Scope Transform can be seen as shuffling the bytes around.

Let’s look at an example. Suppose we have our well-known name set, something like this:

<<Jeffries   Ron >
 <HendricksonChet>>

And we want to select on first name. We might have a selection set that looks like this:

<<Chet>>

The problem with this is that we really need a set that’s more like this:

<<...........Chet>>

So that the first name will “line up” with the first name bytes in the input set. Mathematically, things look like this:

<Jeffries   Ron > = { H⁰, e¹, n², d³, r⁴, i⁵, c⁶, k⁷, s⁸, o⁹, n¹⁰, C¹¹, h¹², e¹³, t¹⁴}
<Chet> = { C⁰, h¹, e², t³}
<...........Chet> = { C¹¹, h¹², e¹³, t¹⁴}

But in flat bytes, we have as yet, no way to indicate that the string “Chet” has its “C” character in position 11, not position 0. Taking a lead from the Scope Transform idea, we might prefix the bytes of each field of a record with little pairs of integers, saying what byte positions are really intended. We might have this for the offset Chet record:

<4, 11, C, h, e, t>

This would mean something like “The next four bytes start at 11, not 0”. Thus, the above set would “mean”:

{ C¹¹, h¹², e¹³, t¹⁴ }

So the idea is that we would have a particular kind of set … quite likely represented by a class, but perhaps by some encoding in memory, that represents an ordered collection of elements, but offset. Let’s try it.

ShiftedRecord -- a Short Spike

The probable use of a ShiftedRecord, if we had one, would be to use it in operations like :restrict. The :restrict operation’s essence comes down to the subset operation: we want to know if our selecting record is a subset of the record being considered for selection. So let’s just create one XSet record, and a couple of ShiftedRecords, and see what we can make happen:

    def test_shifted_record
      r = XSet.new(1, "Hendrickson Chet", 1)
      chet = ShiftedRecord.new(12,4,"Chet")
      ron = ShiftedRecord.new(12,4,"Ron ")
      assert(chet.subset?(r), "Chet not found")
      assert(!ron.subset?(r), "Ron found")
    end

Simple enough. I decided to include the offset and length explicitly in the constructor, though before we’re done, we might well pack them into the first bytes as shown in the example set at the end of the previous section. For now, this should be hard enough: a slightly larger bite than I might usually take, but sitting in the Seattle airport waiting for the red-eye, I’m feeling strong. It takes a few tries to get the details down, but rather quickly I come up with this new class:

  class ShiftedRecord
    def initialize(offset, length, string)
      @offset = offset
      @length = length
      @string = string
    end

    def subset? set
      each do  | se |
        if ( set.element(se.scope) != se.element ) 
          return false
        end
      end
      return true
    end

    def each
      for index in 0...@length
        yield ScopedElement.new(@string[index,1], index+@offset)
      end
    end
  end

Take a quick look. To do :subset?, I needed to iterate over each element of the ShiftedRecord, which meant that I had to build the :each method. That turns out to be easy enough. We look at the raw string’s elements from 0 up to and not including @length. We produce a scoped element, but instead of using the index as its scope, we add the @offset.

If you’re not a Ruby expert, the expression

      @string[index,1]

Might throw you. It turns out – and I either never knew this or had forgotten it myself – that when we access a string, that’s how we get the character back. If we had said:

@string[index]

the code would have returned an integer whose value is that of the string element. Since the element-selecting code in XSet always uses the length (even if it’s 1), we need to use it here as well.

The implementation of subset? is straightforward enough. Note that it’s not the same, however, as in XSet, which has:

    def subset?(larger_set)
      element_range.all? { | scope |
        larger_set.contains?(element(scope), scope)
      }
    end

instead of

    def subset? set
      each do  | se |
        if ( set.element(se.scope) != se.element ) 
          return false
        end
      end
      return true
    end

I am inclined to think of this use of two obviously different pieces of code to do the same thing as a kind of duplication in disguise. If the code were physically the same, we would be thinking of some way to remove it. Since it looks different, we may imagine that it’s different for a reason and accept it as it stands. Because this ShiftedRecord is just an experiment, I don’t really have to do anything, but just to keep sharp, we’ll check it out.

First thing I try is to replace the short version in XSet with the longer version from ShiftedRecord. The tests all run – the ShiftedRecord version is generic enough to work on all set types that we have so far. How about the other way? That won’t work, at least now, because ShiftedRecord doesn’t implement :element_range, nor does it implement :element. It certainly could implement those. In XSet, :element_range just returns 0…cardinality, and :cardinality in ShiftedRecord is equal to @length. And :element is basically what the :each method of ShiftedRecord is computing, along the way to producing the ScopedElement, so it should be quite possible to make this work. But :element_range for a shifted set isn’t 0…cardinality, but instead is @offset…@offset+@cardinality.

I’m not sure whether I should bring these two implementations into line or not. Whether I do or not, though, it’s clear to me that I can, and that is encouraging both from a software design / simplicity viewpoint, and from the theoretical viewpoint as well. The XSet implementation adheres pretty close to pure set theory, and since I’m sure that I could make ShiftedRecord adhere also, I feel that I haven’t gone too far into the pragmatic side of things to keep the code aligned with the math.

For now, I’ll leave things as they are, because the ShiftedRecord is just an experiment to tell me whether there’s a way to field alignment, shifting the byte comparisons around as needed. This isn’t obviously the way to do it, but it is at least one way to do it, and it was simple enough. I’ll pause for now, and sum up.

Theory and Practice Still Close Enough

I was concerned with whether it would be possible to use the Scope Transform idea to let me do the first name restrict. It turns out that it is possible to do that. The implementation I spiked was ultra-simple, in that it just shifted a record from 0 to 11, while the unlimited Scope Transform set operation can map any scope to any other. But the code does transform the scopes, and gives me a bit of insight into what’s possible.

It also raises some questions. The original XSet is really flat: there are no control fields in the records. They are just data. And that’s part of what will make the base-level set operations efficient: they just move bytes around quite simply. While it might be a good idea to include a packet of offset-length information with every field, it seems like it could get pretty costly.

But if we do not include that information in the data, how will we include it? If we wanted to store our ShiftedRecord sets as bytes, how could the software tell the difference between that kind of data, and the purely flat kind? Frankly, at this point, I don’t know. What I have, though, is confidence that I can represent these shifted records, and more complex versions of them, either as classes or as encoded bytes. So I feel safe to proceed.

The Bigger Picture -- Mostly Unknown!

My purpose in writing these articles is to figure out, and share with you, how I would approach a project where both the required features, and the underlying technology, are almost completely unknown. I was raised to believe that we had to understand the requirements, understand the architecture, understand the design, before we could do any coding.

XP and the Agile methods, of course, call for coding from the very beginning. We might be able to stretch our minds to imagine that in an application area where we understood both the application pretty well and the underlying technology pretty well, we could do this. But that’s not our case here.

In this application, I only know that I want to do something with Extended Set Theory, that I want to preserve mathematical purity, and that I want to proceed by developing the program, not just building an overall design. I have to say that so far, so good. I’ve built one class, XSet, that works against the bytes of a set in memory. I’ve built another class, ShiftedRecord, that is interoperable with the XSet but that represents a very exotic concept, namely a set with “a hole” in it.

I really doubt whether ShiftedRecord will make it into the later versions of this program, but it certainly has proven its worth in helping us understand what may be possible.

And I’m certainly not ready to write down the patterns of thinking and implementation that let me do what I’m doing. Perhaps you’ll spot them before I do (and if you do, please write and let me know). The main thing is that so far, at least, I’ve made a bit of progress on the theory, and a bit of progress on the code, and so far neither one seems to be suffering.

I’m inclined to keep pushing to see what happens. I hope you’re still along for the ride. Keep those cards and letters coming in!