diff --git a/src/mike/2024/12/ordered-maps.html b/src/mike/2024/12/ordered-maps.html new file mode 100644 index 0000000..d994437 --- /dev/null +++ b/src/mike/2024/12/ordered-maps.html @@ -0,0 +1,177 @@ + + + + + Ordered Maps + + + + + + +

Ordered Maps

+ +

There's been a lot of discussion recently in the QT4 community group about introducing + ordered maps: that is, maps in which there is a defined and predictable ordering of entries, + typically "last-in, last-out" where the most recently added entry always appears at the end. + The main motivation for this is that JSON is designed (like XML) to be human-readable, but + JSON content in which the entries appear in random order is anything but: if a phone bill + contains a name, address, account number, a summary of charges, and an itemized list of calls, + then you don't want the phone number appearing in the middle, sandwiched between the + list of calls and the list of charges. Currently when data is serialized as JSON, we provide + an option to indent it for readability, but indentation isn't going to make the data readable if + it's in random order.

+ +

Retaining order is particularly useful for visual inspection of changes: if you write code that + modifies one entry in a JSON document, you want to satisfy yourself that the transformation + did exactly what you expected, and the best way to convince yourself is by placing the + input and output side-by-side and comparing them visually.

+ +

There seems to be consensus that support for ordered maps, at least in some circumstances, + is desirable. There is debate about whether all maps should be ordered, and about whether + ordering should be the default, and about whether ordering should be supported if a map + is built incrementally using map:put operations. The answer to those questions + depends at least in part on understanding how great the overhead is in retaining order + in maps: if the overhead is negligible, then we might as well make all maps ordered.

+ +

Normally I'm the first to argue that the language specification should not be driven + by performance concerns: we should design a clean language and leave implementors to + worry about how to implement it efficiently. But in this case, if we're making a change + to the language semantics that affects users whether they want the feature or not, I think + we need to understand clearly whether we are asking users to pay a performance price.

+ +

Both JavaScript (from ES2015) and Python (from 3.7) have moved + in the direction of making all maps (objects/dictionaries) ordered, so we wouldn't + be on our own in doing this. However, JavaScript objects and Python dictionaries + are mutable, whereas XDM maps are functionally persistent (adding an entry + creates a new map, leaving the original unchanged), so the performance + constraints are somewhat different.

+ +

So let's look now at how Saxon implements maps.

+ +

In SaxonJ 12.x there are two main implementations (ignoring special cases such as empty + maps and singleton maps). The default implementation is in the class + net.sf.saxon.ma.map.HashTrieMap, and this is built using an open source + implementation of immutable hash tries written by Michael Froh; it has been in the + product since 9.6. In SaxonCS 12.x we replace this with the functionally equivalent Microsoft + class System.Collections.Immutable.ImmutableDictionary. Both these library + implementations are unordered.

+ +

There is a minor tweak that complicates the implementation. In an ideal world, + we would create an underlying map of type Map<AtomicValue, GroundedValue>, + where AtomicValue is the Saxon class used to hold all atomic values, + and GroundedValue is the Saxon class used to hold all sequences other than + those that are lazily-evaluated. However, AtomicValue.equals() does + not implement the equality semantics defined by XDM for comparing map keys. This + is because XPath has different rules for equality comparisons in different circumstances. + The Microsoft ImmutableDictionary can take a custom KeyComparer + parameter, which would solve this problem, but there is no equivalent in the Froh + library that we use in SaxonJ. So instead we implement an underlying map of type + Map<AtomicMatchKey, Tuple<AtomicValue, GroundedValue>>, where + AtomicMatchKey is a value derived from the AtomicValue + that has the correct equality semantics. We need to hold the AtomicValue + because in general two atomic values can have the same AtomicMatchKey + (for example this is the case when the keys are a mix of different numeric types): + and the XPath functionality for maps requires the original key value (including + its type annotation) to be retained.

+ +

The second implementation of maps found in SaxonJ and SaxonCS is the class + net.sf.saxon.ma.map.DictionaryMap. This is implemented over a standard + mutable java.util.HashMap<String, GroundedValue>> on Java, or + System.Collections.Generic.Dictionary<string, GroundedValue> + on .NET. It is suitable only where the keys are all instances of xs:string + (which means we don't need to retain the type annotation), and where no in-situ + modification takes place. As soon as an operation such as map:put + or map:remove is applied to the map, we make a copy using the + more general HashTrieMap implementation. But for many maps, + especially those derived from JSON parsing, incremental modification is rare, + and the lower-overhead DictionaryMap is perfectly satisfactory.

+ +

In Saxon 13 (not yet released), a third map implementation has been introduced: + the ShapedMap. This is described in the article + Maps and Records, + and it is particularly useful in cases where many maps have exactly the same structure. + This often happens when parsing CSV or JSON files. A ShapedMap is in two + parts: a Shape object which holds a mapping from keys to integer slot numbers, + and a simple array of slots holding the values of the fields. The Shape + object can be shared between all map instances having a common structure. As with the + DictionaryMap, if a ShapedMap is subjected to map:put + or map:remove operations, it is immediately copied to a HashTrieMap.

+ +

How are these map implementations affected by the requirement to maintain order + of entries?

+ +

For the ShapedMap, order is already maintained, so it isn't a problem. + The only impact is that two maps can only share the same Shape object + if their keys are in the same order. There isn't going to be any observable performance + regression.

+ +

For the DictionaryMap, on the Java platform we can replace the + underlying HashMap<String, GroundedValue> by a + LinkedHashMap<String, GroundedValue>. That's easily done, + because it supports the same interface. I don't yet know how much overhead + it imposes (in space or time); that requires some measurements.

+ +

On .NET, unfortunately, there is no equivalent to Java's LinkedHashMap. + I have therefore implemented my own: this comprises a Dictionary<string, int> + that maps string-valued keys to integer positions in the sequence, and two lists: + a list of AtomicValue for the keys and a list of GroundedValue + for the values.

+ +

For the HashTrieMap on Java, my plan is to scrap the immutable map implemented + by Michael Froh, and substitute it with the io.vavr.collection.LinkedHashMap + from the VAVR library, which appears to have the required semantics. Again, there appears + to be no direct equivalent on .NET, so a home grown solution is again called for. My + current implementation uses the same apprach as for the DictionaryMap: + an immutable unordered map from atomic keys to integers, supplemented by ordered + immutable lists of AtomicValue for the keys and GroundedValue + for the values.

+ +

Which brings us to the question, what are the overheads? Answering that question + means making some assumptions about the workloads we want to measure. For example, + how important are map:put and map:remove operations? + Anecdotal evidence suggests these are rather rare, and that most maps are read-only + once built. But they might be important to some use cases.

+ +

The other complication is that we might be able to mitigate the overheads of making + maps ordered by introducing new optimisations. We've already introduced the + ShapedMap idea, where ordering hopefully imposes very little overhead. + On .NET we could consider taking advantage of the ability to use a custom + KeyComparer to avoid the overhead of effectively storing the keys twice.

+ +

We could also get smarter about choosing which implementation of maps to use under + which circumstances. One change that I'm making is to introduce a MapBuilder + class: during the initial construction of a map (for example during JSON parsing or + during processing of map:merge or map:build, or during + evaluation of a map constructor) we can add entries to a mutable builder object, and + this then gives us the opportunity to choose the final map implementation when we + know what all the keys and values are. For example, if all the keys have the same + type annotation, then in principle we don't need to save the type annotations with + every key value. We also know the size of the map at this stage.

+ +

We can even go further and avoid indexing the map until the first lookup + (or map:get) operation. It might seem surprising, but there + are many maps that are never used for lookup. For example, a JSON document + might contain thousands of maps that are simply copied unchanged to the output, + or that are discarded because they are irrelevant to the particular query. + Perhaps the map builder should simply maintain a list of keys and values, + and do nothing else until the first map:get? The only complication + here is the need to detect duplicate keys, but that could be done cheaply + using a Bloom filter.

+ +

So we need to do some measurements. But there's a good chance that if + it does turn out that ordered maps impose an overhead, we can find compensating + optimisations that mean there's no regression on the bottom line.

+ +

My first experiments looking at the cost of parsing and re-serializing + JSON actually suggest that most of the cost is in the parsing and serializing, + and that the choice of data structure for the XDM maps has very little impact + on the bottom line. But that's provisional and subject to confirmation.

+ + + + + + + + \ No newline at end of file