Add a delta store (similar to `DedupStore`) #901

barrbrain · 2024-05-07T03:42:55Z

Inspiration is taken from how git packs achieve high compression rates with good random access performance. The 2 key components are (1) a clustering strategy to store similar objects together and (2) a delta algorithm to encode objects predicted by an existing object. The initial proposed algorithms for analysis are:

TLSH (debian package)
BSDIFF (debian package)

There are crates that implement these algorithms:

For background reading on how git achieves high density in a content-addressed store, see this blog post:
Git’s database internals I: packed object store

The combination of a DeltaStore and DedupStore is similar to the architecture of bup.

The text was updated successfully, but these errors were encountered:

allada · 2024-05-07T18:48:27Z

Hey @barrbrain, thanks for the feedback.

Have you by chance looked at DedupStore?

I believe it is extremely close to what you are suggesting.

barrbrain · 2024-05-08T02:29:17Z

Hi @allada, DedupStore is structurally similar to how I envision a delta store. However, the behavior is somewhat different. While DedupStore is effective for large extents repeated verbatim between objects, a delta store would be effective for large blocks containing many small changes with a consistent pattern. This fits source code changes quite well and binary object changes extremely well. This can boost compression efficiency by 2x~130x for applicable content.

aaronmondal · 2024-05-08T22:44:29Z

This type of compression is new to me, but would it look something like this?

Compute TLSH hashes of inputs, store them.
Assume that "similar inputs likely produce similar outputs" and group objects accordingly.
When an action runs on inputs similar to a previous action (maybe it's the same command, just a file was edited), instead of storing outputs as an entirely new file, generate the (binary) diff between the new output and the old one and only store the diff.

This seems like an interesting problem that could potentially reduce stored data significantly or at least improve the time it takes to pass incremental changes around workers (if a worker has an artifact in a local memory store already, it might only need to fetch a diff instead of an entirely new file).

Am I understanding correctly that the dedup store does something very similar, but delta compression is more "specialized" towards incremental builds? If so, it seems intuitive that this kind of compression could outperform more the more general dedup approach.

cc @MarcusSorealheis It might be a very far reach, but would it theoretically be possible to offload similarity computations to a vector database like https://qdrant.tech/ or am I completely looking at the wrong thing here? 😅

barrbrain · 2024-05-09T05:07:12Z

@aaronmondal Thank you for the insightful link to Qdrant. This highlights the missing component in my proposal, an approximate nearest neighbors (ANN) algorithm. I note that there is an active Rust crate that implements hierarchical navigable small worlds (HNSW), hnsw_rs.

MarcusSorealheis · 2024-05-09T10:00:14Z

Only a library is needed @aaronmondal in this case, not a DB. It is a good thought.

allada · 2024-05-12T19:50:38Z

This is interesting. I've been thinking a bit about a very similar CS problem we have, in that I want to be better at finding workers that already have most of the assets and run jobs on them instead of LRU/MRU like we use now.

I started playing around with SimHash algorithms to see if they could solve my problem, but found them not quite what I was looking for in this case. I then started looking into KD-trees, but found it to scale horribly and then went on to ANN (approximate nearest neighbor), these worked great, but it felt like "bringing a tank to a knife fight". I eventually settled on Bloom Filter to be the best for this problem.

In the problem you are describing, I feel SimHash and/or ANN might do very well.

barrbrain · 2024-05-14T15:30:24Z

Notes from some offline analysis

I created a corpus to explore this issue by collecting a snapshot of a CAS with DedupStore defaults and no other filters. I built 4 consecutive tags of Chromium for Linux, such that there were a mix of objects shared verbatim between builds, unique objects and similar objects.

To test the quality of hash distance in predicting delta efficiency: for each approximate nearest neighbour, the size of compressed delta and compressed raw object were computed. Separating text and binary objects gives a clearer picture but overall they have similar characteristics. A relatively low distance threshold is sufficient for most of the available compression gains. For this corpus, 9.5% of objects had neighbours within a TLSH distance of 14 and with efficient delta encodings. On average, these deltas were 92% smaller than baseline compression.

Although this only represents a 11.9% improvement in storage density overall, there is a 12-times improvement for incremental changes in content. Edit: updated numbers with BSDIFF4+LZ4 deltas.

barrbrain · 2024-05-23T16:00:57Z

Here is an overview of the relationship between TLSH distance and delta efficiency over baseline compression, for this corpus. The y-axis is log2(delta/baseline). The x axis is TLSH distance between object pairs.

Edit: Using the above model, I refined the delta selection constraints. For this corpus, 17.8% of objects had neighbours within a TLSH-256-1 distance of 104 and delta encodings at least 4 times as efficient as baseline compression. On average, these deltas were 94.1% smaller than baseline compression. Although this only represents a 15.6% improvement in storage density overall, there is a 16.8-times improvement for incremental changes in content.

allada · 2024-05-23T16:22:51Z

This is very interesting. the RBE working group is actively looking/discussing implementing something very similar to DedupStore as part of the protocol itself:
bazelbuild/remote-apis#282

I'll discuss with some of the internal team to see what kind of complexities and gotchas this might introduce.

barrbrain · 2024-06-11T20:24:12Z

In the problem you are describing, I feel SimHash and/or ANN might do very well.

In prior attempts at this issue, I have seen good results with a carefully constructed MinHash.

There is a straightforward reduction of TLSH to a cluster fingerprint:
Take a suffix of the code part and for each 2-bit column, reduce to 1-bit predicated on whether the value was in the top quartile.

This yields similar recall performance to HNSW with just a range query on the composite key.

An example from my corpus:

EC408104A021-T165B48E174216BCE0C8A859FE4627C6F005B0C957ACD378FEF6341AD6275ABADE808D47:
5f6e7fcec33b16a636b829d3a4408cf2d01c0404bea9457d7046b5eb22ade7f9-524288
EC408104A021-T12FB48E174216BCE0C8A859FE4627C6F005B0C957ACD378FEF5341AD6275ABADE808E47:
0052291bdaec9ba6429cde4ccfd52f0ba9c1b844ff69f53d861007b1dc770467-524288

Edit: This clustering method has a significant impact on the distribution:

Applying a similar approach within DedupStore and running the same reference builds, 13.5% of objects in the content store were delta-encoded and there was a 13.3% improvement in storage density overall.

barrbrain · 2024-10-23T03:05:29Z

I built a prototype and tested it with daily chromium builds. It was able to achieve an overall ratio of 1:4 on compressible content. Ratios for small objects were 1:15 at best and large objects 1:213 at best.
I will propose a set of stores that may be composed to mimic git packing. In combination with DedupStore, they can mimic bup.

MarcusSorealheis · 2024-10-23T04:24:54Z

Thank you for all the hard work here.

barrbrain · 2024-10-27T21:59:37Z

There is a straightforward reduction of TLSH to a cluster fingerprint:
Take a suffix of the code part and for each 2-bit column, reduce to 1-bit predicated on whether the value was in the top quartile.

This yields similar recall performance to HNSW with just a range query on the composite key.

A enhancement to this method is to further reduce the prefix by taking to logical-or of bit pairs.
From the TLSH code part, 4-bit columns are reduced to 1-bit, with a 0.4375 probability of being set.
This bias may be addressed by inverting every other bit, which yields the following probabilities for 2 bits:
0.24609375, 0.31640625, 0.19140625, 0.24609375
This final transformation has no impact to exact prefix matches but provides the most uniform distribution function for every suffix. This appears to improve locality of ordered keys.
Sorting the corpus by prefix and considering nearest neighbor pairs by TLSH distance, the absolute difference in indices had quartiles: 5, 52, 998. As a fraction of corpus size: 1/183137, 1/17609, 1/ 918

This clustering method has a more significant impact on the distribution:

I will eventually benchmark this with DedupStore integration. This method is a simplification of what I used in the prototype. (#901 (comment)) There I range-coded the "is in 4th quartile" symbols and truncated the output to 32 bits. The method above is equivalent to range coding 2 symbols at a time and truncating the output to 1 bit.

barrbrain · 2024-12-05T07:34:45Z

I will eventually benchmark this with DedupStore integration. This method is a simplification of what I used in the prototype. (#901 (comment)) There I range-coded the "is in 4th quartile" symbols and truncated the output to 32 bits. The method above is equivalent to range coding 2 symbols at a time and truncating the output to 1 bit.

With the aid of the store driver drafted in #1468, I measured an additional 3.8% improvement in storage density with this change in clustering method. Mere equivalence would have satisfied. 😄

barrbrain · 2024-12-24T13:39:17Z

An update with the same clustering method and aehobak-encoded deltas, which improve upon the efficiency of bsdiff.
The y-axis is log2(delta/baseline). The x-axis is TLSH distance. Note the significant change in scale of the y-axis.
Edit: Updated with results from revised format.

barrbrain changed the title ~~Add a delta store (similar to dedupe store)~~ Add a delta store (similar to DedupStore) May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a delta store (similar to `DedupStore`) #901

Add a delta store (similar to `DedupStore`) #901

barrbrain commented May 7, 2024 •

edited

Loading

allada commented May 7, 2024

barrbrain commented May 8, 2024

aaronmondal commented May 8, 2024

barrbrain commented May 9, 2024 •

edited

Loading

MarcusSorealheis commented May 9, 2024 •

edited

Loading

allada commented May 12, 2024

barrbrain commented May 14, 2024 •

edited

Loading

barrbrain commented May 23, 2024 •

edited

Loading

allada commented May 23, 2024

barrbrain commented Jun 11, 2024 •

edited

Loading

barrbrain commented Oct 23, 2024

MarcusSorealheis commented Oct 23, 2024

barrbrain commented Oct 27, 2024 •

edited

Loading

barrbrain commented Dec 5, 2024 •

edited

Loading

barrbrain commented Dec 24, 2024 •

edited

Loading

Add a delta store (similar to DedupStore) #901

Add a delta store (similar to DedupStore) #901

Comments

barrbrain commented May 7, 2024 • edited Loading

allada commented May 7, 2024

barrbrain commented May 8, 2024

aaronmondal commented May 8, 2024

barrbrain commented May 9, 2024 • edited Loading

MarcusSorealheis commented May 9, 2024 • edited Loading

allada commented May 12, 2024

barrbrain commented May 14, 2024 • edited Loading

Notes from some offline analysis

barrbrain commented May 23, 2024 • edited Loading

allada commented May 23, 2024

barrbrain commented Jun 11, 2024 • edited Loading

barrbrain commented Oct 23, 2024

MarcusSorealheis commented Oct 23, 2024

barrbrain commented Oct 27, 2024 • edited Loading

barrbrain commented Dec 5, 2024 • edited Loading

barrbrain commented Dec 24, 2024 • edited Loading

Add a delta store (similar to `DedupStore`) #901

Add a delta store (similar to `DedupStore`) #901

barrbrain commented May 7, 2024 •

edited

Loading

barrbrain commented May 9, 2024 •

edited

Loading

MarcusSorealheis commented May 9, 2024 •

edited

Loading

barrbrain commented May 14, 2024 •

edited

Loading

barrbrain commented May 23, 2024 •

edited

Loading

barrbrain commented Jun 11, 2024 •

edited

Loading

barrbrain commented Oct 27, 2024 •

edited

Loading

barrbrain commented Dec 5, 2024 •

edited

Loading

barrbrain commented Dec 24, 2024 •

edited

Loading