diff --git a/src/mike/2024/12/ordered-maps.html b/src/mike/2024/12/ordered-maps.html new file mode 100644 index 0000000..d994437 --- /dev/null +++ b/src/mike/2024/12/ordered-maps.html @@ -0,0 +1,177 @@ + + + +
+There's been a lot of discussion recently in the QT4 community group about introducing + ordered maps: that is, maps in which there is a defined and predictable ordering of entries, + typically "last-in, last-out" where the most recently added entry always appears at the end. + The main motivation for this is that JSON is designed (like XML) to be human-readable, but + JSON content in which the entries appear in random order is anything but: if a phone bill + contains a name, address, account number, a summary of charges, and an itemized list of calls, + then you don't want the phone number appearing in the middle, sandwiched between the + list of calls and the list of charges. Currently when data is serialized as JSON, we provide + an option to indent it for readability, but indentation isn't going to make the data readable if + it's in random order.
+ +Retaining order is particularly useful for visual inspection of changes: if you write code that + modifies one entry in a JSON document, you want to satisfy yourself that the transformation + did exactly what you expected, and the best way to convince yourself is by placing the + input and output side-by-side and comparing them visually.
+ +There seems to be consensus that support for ordered maps, at least in some circumstances,
+ is desirable. There is debate about whether all maps should be ordered, and about whether
+ ordering should be the default, and about whether ordering should be supported if a map
+ is built incrementally using map:put
operations. The answer to those questions
+ depends at least in part on understanding how great the overhead is in retaining order
+ in maps: if the overhead is negligible, then we might as well make all maps ordered.
Normally I'm the first to argue that the language specification should not be driven + by performance concerns: we should design a clean language and leave implementors to + worry about how to implement it efficiently. But in this case, if we're making a change + to the language semantics that affects users whether they want the feature or not, I think + we need to understand clearly whether we are asking users to pay a performance price.
+ +Both JavaScript (from ES2015) and Python (from 3.7) have moved + in the direction of making all maps (objects/dictionaries) ordered, so we wouldn't + be on our own in doing this. However, JavaScript objects and Python dictionaries + are mutable, whereas XDM maps are functionally persistent (adding an entry + creates a new map, leaving the original unchanged), so the performance + constraints are somewhat different.
+ +So let's look now at how Saxon implements maps.
+ +In SaxonJ 12.x there are two main implementations (ignoring special cases such as empty
+ maps and singleton maps). The default implementation is in the class
+ net.sf.saxon.ma.map.HashTrieMap
, and this is built using an open source
+ implementation of immutable hash tries written by Michael Froh; it has been in the
+ product since 9.6. In SaxonCS 12.x we replace this with the functionally equivalent Microsoft
+ class System.Collections.Immutable.ImmutableDictionary
. Both these library
+ implementations are unordered.
There is a minor tweak that complicates the implementation. In an ideal world,
+ we would create an underlying map of type Map<AtomicValue, GroundedValue>
,
+ where AtomicValue
is the Saxon class used to hold all atomic values,
+ and GroundedValue
is the Saxon class used to hold all sequences other than
+ those that are lazily-evaluated. However, AtomicValue.equals()
does
+ not implement the equality semantics defined by XDM for comparing map keys. This
+ is because XPath has different rules for equality comparisons in different circumstances.
+ The Microsoft ImmutableDictionary
can take a custom KeyComparer
+ parameter, which would solve this problem, but there is no equivalent in the Froh
+ library that we use in SaxonJ. So instead we implement an underlying map of type
+ Map<AtomicMatchKey, Tuple<AtomicValue, GroundedValue>>
, where
+ AtomicMatchKey
is a value derived from the AtomicValue
+ that has the correct equality semantics. We need to hold the AtomicValue
+ because in general two atomic values can have the same AtomicMatchKey
+ (for example this is the case when the keys are a mix of different numeric types):
+ and the XPath functionality for maps requires the original key value (including
+ its type annotation) to be retained.
The second implementation of maps found in SaxonJ and SaxonCS is the class
+ net.sf.saxon.ma.map.DictionaryMap
. This is implemented over a standard
+ mutable java.util.HashMap<String, GroundedValue>>
on Java, or
+ System.Collections.Generic.Dictionary<string, GroundedValue>
+ on .NET. It is suitable only where the keys are all instances of xs:string
+ (which means we don't need to retain the type annotation), and where no in-situ
+ modification takes place. As soon as an operation such as map:put
+ or map:remove
is applied to the map, we make a copy using the
+ more general HashTrieMap
implementation. But for many maps,
+ especially those derived from JSON parsing, incremental modification is rare,
+ and the lower-overhead DictionaryMap
is perfectly satisfactory.
In Saxon 13 (not yet released), a third map implementation has been introduced:
+ the ShapedMap
. This is described in the article
+ Maps and Records,
+ and it is particularly useful in cases where many maps have exactly the same structure.
+ This often happens when parsing CSV or JSON files. A ShapedMap
is in two
+ parts: a Shape
object which holds a mapping from keys to integer slot numbers,
+ and a simple array of slots holding the values of the fields. The Shape
+ object can be shared between all map instances having a common structure. As with the
+ DictionaryMap
, if a ShapedMap
is subjected to map:put
+ or map:remove
operations, it is immediately copied to a HashTrieMap
.
How are these map implementations affected by the requirement to maintain order + of entries?
+ +For the ShapedMap
, order is already maintained, so it isn't a problem.
+ The only impact is that two maps can only share the same Shape
object
+ if their keys are in the same order. There isn't going to be any observable performance
+ regression.
For the DictionaryMap
, on the Java platform we can replace the
+ underlying HashMap<String, GroundedValue>
by a
+ LinkedHashMap<String, GroundedValue>
. That's easily done,
+ because it supports the same interface. I don't yet know how much overhead
+ it imposes (in space or time); that requires some measurements.
On .NET, unfortunately, there is no equivalent to Java's LinkedHashMap
.
+ I have therefore implemented my own: this comprises a Dictionary<string, int>
+ that maps string-valued keys to integer positions in the sequence, and two lists:
+ a list of AtomicValue
for the keys and a list of GroundedValue
+ for the values.
For the HashTrieMap
on Java, my plan is to scrap the immutable map implemented
+ by Michael Froh, and substitute it with the io.vavr.collection.LinkedHashMap
+ from the VAVR library, which appears to have the required semantics. Again, there appears
+ to be no direct equivalent on .NET, so a home grown solution is again called for. My
+ current implementation uses the same apprach as for the DictionaryMap
:
+ an immutable unordered map from atomic keys to integers, supplemented by ordered
+ immutable lists of AtomicValue
for the keys and GroundedValue
+ for the values.
Which brings us to the question, what are the overheads? Answering that question
+ means making some assumptions about the workloads we want to measure. For example,
+ how important are map:put
and map:remove
operations?
+ Anecdotal evidence suggests these are rather rare, and that most maps are read-only
+ once built. But they might be important to some use cases.
The other complication is that we might be able to mitigate the overheads of making
+ maps ordered by introducing new optimisations. We've already introduced the
+ ShapedMap
idea, where ordering hopefully imposes very little overhead.
+ On .NET we could consider taking advantage of the ability to use a custom
+ KeyComparer
to avoid the overhead of effectively storing the keys twice.
We could also get smarter about choosing which implementation of maps to use under
+ which circumstances. One change that I'm making is to introduce a MapBuilder
+ class: during the initial construction of a map (for example during JSON parsing or
+ during processing of map:merge
or map:build
, or during
+ evaluation of a map constructor) we can add entries to a mutable builder object, and
+ this then gives us the opportunity to choose the final map implementation when we
+ know what all the keys and values are. For example, if all the keys have the same
+ type annotation, then in principle we don't need to save the type annotations with
+ every key value. We also know the size of the map at this stage.
We can even go further and avoid indexing the map until the first lookup
+ (or map:get
) operation. It might seem surprising, but there
+ are many maps that are never used for lookup. For example, a JSON document
+ might contain thousands of maps that are simply copied unchanged to the output,
+ or that are discarded because they are irrelevant to the particular query.
+ Perhaps the map builder should simply maintain a list of keys and values,
+ and do nothing else until the first map:get
? The only complication
+ here is the need to detect duplicate keys, but that could be done cheaply
+ using a Bloom filter.
So we need to do some measurements. But there's a good chance that if + it does turn out that ordered maps impose an overhead, we can find compensating + optimisations that mean there's no regression on the bottom line.
+ +My first experiments looking at the cost of parsing and re-serializing + JSON actually suggest that most of the cost is in the parsing and serializing, + and that the choice of data structure for the XDM maps has very little impact + on the bottom line. But that's provisional and subject to confirmation.
+ + + + + + + + \ No newline at end of file