FAFO on GitHub

You probably do not want to read this. I’m writing down my thoughts, in the vain hope of figuring out what they are and what they should be. Dave Childs was kind enough to send me some links to articles and other materials about XST. I include them at the end of this article, for reasons. The materials gave me some things to think about.

Thoughts

Several of the articles listed below got me thinking about performance. There have been a few instances of XST being used to produce very high performance on real benchmarks. Without knowledge of just how XST was implemented, nor how the benchmark programs were expressed, I have no real insight into how it was done. Some indications suggest that part of its performance was due to very fast operations manipulating arrays of characters.

It’s clear enough, I think, that structures like grouping and indexing can be created using XST, and I plan to do a bit of that here myself. But Python is pretty far away from the bytes, and I’m not sure how we might handle that. I really don’t think that string slicing is likely to give us what we want: I think it’ll probably create mass quantities of string fragments and violent thrashing.

It’s an interesting limitation, as there are some fancy things one can do with moving bytes around and slicing and dicing, and they may be difficult to impossible to even emulate in Python, much less actually do. But there are some limited arrays and a buffer protocol that may be useful.

More Thoughts

Although set theory thinks that { a, b, c } equals { a, b, b, c, c, c }, humans generally prefer their reports to contain no duplication. Summary and statistical information, in particular, want to add in values just once, not two or three times for the “same” element. As implemented, some algorithms for set operations may produce duplicates, and others never can.

If a set’s storage representation contains no duplicates, then no operation that removes or selects elements from that set can produce duplicates. The intersection of two sets that do not contain duplicates will not contain duplicates.

However, the union of two sets that do not contain duplicates must be done carefully if its machine representation is not to contain duplicates.

A projection of a set can produce duplicates, depending what we remove with the projection. Consider:

{ {kittyname, catspecies}, {oscarname, catspecies} }

If we were to naively project off the name field, we might wind up with:

{ { catspecies}, { catspecies} },

an apparent duplication. Formally, our set operations might still work, but if we were to print that set some day, the physical duplication would probably need to be removed.

A Mistake?

We presently have a few different storage schemes for our sets: XFrozen, XTuple, XFlat, and XFlatFile. What can we say about duplicates in each of these schemes?

XFrozen cannot contain duplicates, because the frozenset implementation removes them.

XTuple cannot contain duplicates, because it has one and only one element at each scope from 1 to its length. The individual elements might be the same, but because the scopes are different, the set contains no duplicates.

XFlatFile maps a set of names onto fixed-layout bytes of a file. If the file has duplicate records, its XFlatFile representation can contain duplicate elements. However, XFlatFile presents a tuple: each record has a unique scope. So, as defined, it can never contain duplicates. In practice, our eyes might tell us otherwise.

XFlat maps a set of names onto an “array” of characters. In principle, the name set could contain the same name twice, mapping to the same region, or to two regions with the same characters, so the record could contain duplicates.

Generally, systems like this one can sort sets and weed out duplicates when they occur. At this writing, I do not know a general way to do that.

I mentioned a mistake. My XSet class implements XST faithfully, as far as it goes, I believe. And, as written, up until now, it will not produce any physical duplicates in its data. That is primarily true because our code rather quickly brings everything down to tuples in instances of frozenset, which crunches out the duplicates.

Because our XFlatFile treats its underlying data as representing a tuple, any duplicate records are “ok”, because they have different scopes.

However, our storage schemes are not described set theoretically, and the mapping between a set at the data level and one of our storage schemes is not described set theoretically. The mapping is ad hoc. I think it is also “correct”, because I was careful, but it’s certainly not formalized.

That may be a mistake. Let me try to explain why.

Suppose that we have a flat file which might contain duplicate records. As a file, its proper set definition probably is that it is a tuple of fixed-length records. Suppose that we want those duplicates to be gone.

One way to express that would be to map all the scopes to Null:

{ 1, 2, … }

That might produce a set that was physically like { a, b, b, c, c, c }, but formally speaking, the duplicates are gone. We still have the issue of physical duplicates.

Informally, we can sort the file and spin through it removing duplicates. Formally speaking, the initial storage set and the resulting set might be considered to be equal at the data level, but at the storage level, they are clearly different. Our code can reflect that difference … but our theory does not.

At this writing, I do not know how to deal with this.

Bibliography

I claimed to have reasons for including these, and I have. First, I want a place to refer to the next time this inclination comes over me. Second, it is barely possible that some reader might like to peruse some of the source material.