From 4266f3dd5f37f597339594d4e7dd0a6e4e6f5987 Mon Sep 17 00:00:00 2001 From: Frank McSherry Date: Tue, 10 Sep 2024 20:42:10 -0400 Subject: [PATCH] succinct --- posts/2024-09-10.md | 278 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 posts/2024-09-10.md diff --git a/posts/2024-09-10.md b/posts/2024-09-10.md new file mode 100644 index 0000000..1926ee2 --- /dev/null +++ b/posts/2024-09-10.md @@ -0,0 +1,278 @@ +## A Fascinating Diversion into Compression + +I've recently been working on columnar data representations in Rust. +The intent is that if you have lists of some complex type `T`, say a `struct` or an `enum` or a list itself (`Vec`), you might have better options than storing the types themselves in a list. +The [`columnar` repository](https://github.com/frankmcsherry/columnar) contains the work, and a [previous post](https://github.com/frankmcsherry/blog/blob/master/posts/2024-08-25.md) goes in to detail about the process. + +What I want to talk about today is the curious case of `Result`. + +We'll start with a pretty standard intro to how you might represent this in a columnar form, but by the end of the post we'll have developed a way to compress JSON objects. +I was suprised and delighted by the connection, which I stumbled upon more than discovered, and would love to hear more about if you understand where this came from. + +### Re-orienting `Result` to columnar form + +The `Result` type is a [sum type](https://en.wikipedia.org/wiki/Tagged_union), each instance of which can either be `Ok(S)` or `Err(T)`. +That is, it is either one type or the other type, but not both. It is different from a pair `(S, T)` which always has both types. +In addition to either `S` or `T`, it also has to communicate *which* of the two it is (they might be the same, or the same size). + +The size of a `Result` is determined by the sizes of `S` and `T`, but it clearly needs to be able to hold either of the two, and a spare bit to indicate which it is. +For example, the size of a `Result` is 16 bytes, but also the size of a `Result` is also 16 bytes. +These might be larger than you expect because of [alignment](https://doc.rust-lang.org/reference/type-layout.html): types in Rust end up sized at least to an integral multiple of their alignment, which is generally at least the size of the largest member. +If you have a sequence of `Result` you'll need as 16 bytes times as many elements as you have, even if many of them are the `u8` variant. + +An alternate way to represent the same information as a sequence of results, is to [demultiplex](https://en.wikipedia.org/wiki/Multiplexer) the two variants into two sequences, leaving enough information to re-interleave them. +```rust +/// Replacement for `Vec>`. +struct ResultColumns { + /// Variant, and corresponding offset. + indexes: Vec>, + /// The `S` variants in order. + s_store: Vec, + /// The `T` variants in order. + t_store: Vec, +} +``` +In this example, the backing data are stored in `s_store` and `t_store`, and elements can be retrieved from the information in `indexes`: +```rust +impl ResultColumns { + /// Not quite `&Result` but pretty close. + fn get(&self, index: usize) -> Result<&S, &T> { + match self.indexes[index] { + Ok(index) => &self.s_store[index], + Err(index) => &self.t_store[index], + } + } +} +``` + +This pattern allows us to store `S` and `T` records separately, but with some overhead. +Each element also requires a `Result`, which .. is the same size as the hypothetical `Result` we started with. +So that's not great. + +### Succinter data structures + +It turns out that `indexes` stores way more information than we strictly need. +We need to be able to see which variant, `Ok` or `Err`, an element is, and figure out where to find it in the corresponding store. + +We could alternately use a [bit vector](https://en.wikipedia.org/wiki/Bit_array), a bit like a `Vec`, to record the information in `indexes`. +To implement `get(index)` we would look in the bit vector to determine which variant is at that location. +Then, to determine where to find it .. ah .. we could count the number of occurences of that bit that precedes the one we found. +```rust +impl ResultColumns { + /// Not quite `&Result` but pretty close. + fn get(&self, index: usize) -> Result<&S, &T> { + let bit = self.indexes[index]; + let pos = self.indexes + .iter() + .take(index) + .filter(|x| == bit) + .count(); + match bit { + 0 => Ok(&self.s_store[pos]), + 1 => Err(&self.t_store[pos]), + } + } +} +``` + +And you are probably already screaming about how inefficient this is. + +This implementation is quite tight for space. +We beyond the `S` and `T` data itself, we store one additional bit for every record. +It's hard to imagine using any less, but it's also hard to imagine going any slower than scanning all of `self.indexes`. +The good news is that if we allow the memory to creep up just a .. bit .. we can recover random access. + +In the field of [succinct data structures](https://en.wikipedia.org/wiki/Succinct_data_structure) folks study among other things the problem of succinct indexable dictionaries. +We will think of these as bit sequences, with additional support for a `rank` function that says for each position how many bits are set before it. +It turns out, this is exactly what we want: the `rank` function (combined with `index`) tells us exactly where to find our data! +If we make the arbitrary decision that `1` corresponds to `S`, then it looks like this: +```rust +impl ResultColumns { + /// Not quite `&Result` but pretty close. + fn get(&self, index: usize) -> Result<&S, &T> { + let bit = self.indexes[index]; + let pos = self.indexes.rank(index); + match bit { + 0 => Ok(&self.s_store[pos]), + 1 => Err(&self.t_store[index - pos]), + } + } +} +``` + +The cost of these succinct indexable dictionaries is "barely more than a bit". +Formally, it needs to be `1 + o(1)` for each element, so in the limit basically just a bit. + +They are pretty complicated, so I implemented one that has exactly two bits for each element. +This is *not* "succinct" in the technical sense, but instead "compact", according to Wikipedia. +Whatever it is, you can draw the overhead down close to one bit, at some cost (Guy Jacobson's PhD thesis is ~$40 from ProQuest). + +### Compacter data structures + +I thought I'd talk through my implementation, which is not very smart but is very easy. +```rust +struct CompactBits { + counts: Vec, // counts ones in preceding `values`. + values: Vec, // contains bits packed together. + last_word: u64, // in-progress bits, not yet 64 of them. + last_bits: u8, // number of in-progress bits. +} +``` +Without going in to great detail, we pack all the bits into a `u64`, and put those in a list (`values`). +At the same time, we also maintain the running sum of the number of ones in these blocks (`counts`). +There may be some bits that aren't cleanly packed into a `u64`, and we hold on to these separately. + +To randomly access bits we look things up either in `values` or in `last_word`. +To determine the `rank`, we get the running sum from `counts`, and then count various bits in the word holding our bit. + +As you can see, it's pretty simple and only two bits for each element because the counts pace the bits themselves. +In fact, we could drop down to 1.5 bits by just using `u32` for the running counts, because .. maybe we don't plan to hold more than 4B elements? +Let's not do that now, but perhaps by the end of the post you'll have a better idea. + +### Adaptive Dictionary Compression + +We are about to escalate things, but it's actually a very easy and pleasant path to take. +Don't get stressed by the intimidating section heading! + +When we implemented our `get(index)` function we were able to find our record in either variant storage. +We did this by looking up its position, and then looking in the storage. +You might have noticed that we either used `pos` or `index - pos`, depending on which variant it was. +The one that corresponds to our variant .. was the answer we wanted! Tada! + +What is at the other location, though? + +For any `index` that results in `pos`, both `s_store[pos]` and `t_store[index - pos]` contain data. +One of the two of them is the data we indexed to find. +The other one of them is .. the most recent other variant in the list? +What could you possibly want to do with that? + +Let's imagine we have a sequence of `S` that we thought might have repetitions in it. +There are probably some interesting ways to encode this, but here's a really easy one using `ResultColumns`: +1. If the item is not a repeat of the item before it, insert `Ok(item)`. +2. If the item is a repeat of the item before it, insert `Err(())`. + +We inserted repeats using the almost meaningless `Err(())`, an error variant containing the empty tuple (which takes no space and stores no information). +We inserted non-repeats using `Ok(item)`, which will land `item` into `s_store`. +If we go and look things up the normal way, we'll find the non-repeats, and find nothing meaningful for the repeats. + +Instead let's look things up a non-normal way. +1. If we find an `Ok(item)` we will produce `item`. +2. If we find an `Err(())` we will .. instead look up the most recent `Ok(item)` at the time of the index's insertion and return `item`. + +When we find the `Err` variant we actually find the data we are looking for in the other store. +The *encoding* used the most recent value to lead us to choose the error variant, and we can decode with the same context. + +The `ResultColumns` ends up storing two bits per element, and only the deduplicated `S` values (although only removing adjacent duplicates). +That's potentially pretty handy compression, and we didn't even have to invent anything to do it. +It just sort of happens. + +### Adaptive Dictionary Compression, part 2 + +Deduplication isn't exactly dictionary compression, so let's fix that. + +```rust +ResultColumns +``` + +Boom. +Ok. +Done here. + +Ah, in more detail, then. +Rather than use `Err(())` to encode "direct repetition", we'll use `Err(offset)` to indicate "recent repetition". +The offset will tell us how far back to go in `s_store` to find our value. +An offset of zero indicates "the previous value" and an offset of 10 indicates "go back ten values". + +```rust +/// Insert an item by first checking the previous 256. +fn push(&mut self, item: S) { + // Look backwards for a matching value. + let offset = self.inner + .s_store + .iter() + .take(256) + .position(|x| x == item); + + if let Some(back) = offset { + self.inner.push(Err(back)); + } else { + self.inner.push(Ok(item)); + } +} +``` + +Although it may look like we are just looking at the current `s_store` for values, we'll be able to return to this exact point with `pos`. +```rust +/// Retrieve an item reference by index. +fn get(&self, index: usize) -> &S { + match self.inner.get(index) { + Ok(item) => item, + Err(back) => { + // Go backwards from `s_store` at time of insertion. + let pos = self.inner.indexes.rank(index) - 1; + self.inner.s_store.get(pos - (*back as usize)) + }, + } +} +``` + +We've encoded elements in a sequence of `S` with (ideally) a `u8` back-reference to a matching element, or if that fails then the element itself. +Ideally this is often a `u8` rather than whatever `S` is. +Informally, we are using the previous 256 distinct `S` values as a dictionary (yes distinct, because how could they repeat?). +As the sequence moves along, the dictionary we use adapts, admittedly in a primitive but not unhelpful way. + +### Trying it out + +I have a bunch of JSON that I downloaded from the internet, and am trying to get to work with `columnar`. +One thing you might know about JSON is that a JSON value can be an "object": a map from strings to other JSON values. +Commonly, these strings are field or attribute names, and they do not always demonstrate a rich heterogeneity. + +Without wanting to overwhelm you with detail (JSON is structurally recursive, and columnarization is complicated), let's look at how we might store objects. +Our columnar JSON container has a member that initially looks like: + +```rust +/// Columnar representation of `Vec<(String, Value)>`. +objects: VecColumns<(StringColumns, Vec)>, +``` + +There are a few details here, but the important ones are: +1. `StringColumns` stores as many strings as you like, by concatenating their bytes and recording offsets. +2. `VecColumns` stores as many list of things as you like, by concatenating the lists and recording offsets. +3. There is a sneaky `( , )` combinator in there that stores pairs of things in pairs of storage. + +The tl;dr is that all of the `String`s across all of the objects will be packed in sequence in the `StringColumns`. + +If you load up the 11,351 JSON records (26.1MB), you end up with `9,460,926` bytes in the `StringColumns`. +These are unsurprisingly largely repetitions of things like `"id"`, `"login"`, and `"gravatar_id"`. +Ideally these will compress up nicely. + +We'll make a minor modification to the type, wrapping `StringColumns` with a `LookbackColumns` wrapper that employs the techniques above. +```rust +/// Columnar representation of `Vec<(String, Value)>`. +objects: VecColumns<(LookbackColumns, Vec)>, +``` +This ends up with `772,022` bytes data in the `LookbackColumns`. +Of those bytes, almost none are text. +They break down as: +* `indexes`: 153,840 bytes (two bits per element) +* `s_store`: 2,958 bytes text (all non-repeat elements) +* `t_store`: 615,224 bytes (one byte per repeat element) + +The bytes in `t_store` are surprisingly varied. +They go as large as 168, for which there are 10,162 entries (out of at least 615,224 strings used as names in objects). +Clearly this will depend on your data, as will the efficacy of the approach generally. + +### Rounding up + +I basically stumbled on the approach above while poking around learning about succinct data structures (which the thing I implemented is not). +I was impressed by how little overhead there can be to recording variants of different sizes, given how accustomed to it I have become to the bloat in Rust, and all without having to give up random access. + +But I was more flabbergasted by the ability to use the *other variants* as a way of compressing information. +Just recording offsets backwards results in a natural dictionary encoding that requires no auxiliary structures or code or what have you. +It just kind of works. + +I have a few other things planned, to try and fit other techniques into the same framework. +For example, many of the string and vector offsets are (or are nearly) linear functions of `index`. +This happens when you are actually depositing fixed lengths that are not know *a priori*. +It seems entirely reasonable to encode the offsets as `Result` indicating either a fixed position, or an edit to a linear interpolation from the previous fixed position. +Or (goodness) some other more complex interpolation from the previous several points. \ No newline at end of file