Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a delta store (similar to DedupStore) #901

Open
barrbrain opened this issue May 7, 2024 · 15 comments
Open

Add a delta store (similar to DedupStore) #901

barrbrain opened this issue May 7, 2024 · 15 comments

Comments

@barrbrain
Copy link

barrbrain commented May 7, 2024

Inspiration is taken from how git packs achieve high compression rates with good random access performance. The 2 key components are (1) a clustering strategy to store similar objects together and (2) a delta algorithm to encode objects predicted by an existing object. The initial proposed algorithms for analysis are:

  1. TLSH (debian package)
  2. BSDIFF (debian package)

There are crates that implement these algorithms:

  1. tlsh2
  2. qbsdiff

For background reading on how git achieves high density in a content-addressed store, see this blog post:
Git’s database internals I: packed object store

The combination of a DeltaStore and DedupStore is similar to the architecture of bup.

@allada
Copy link
Member

allada commented May 7, 2024

Hey @barrbrain, thanks for the feedback.

Have you by chance looked at DedupStore?

I believe it is extremely close to what you are suggesting.

@barrbrain barrbrain changed the title Add a delta store (similar to dedupe store) Add a delta store (similar to DedupStore) May 8, 2024
@barrbrain
Copy link
Author

Hi @allada, DedupStore is structurally similar to how I envision a delta store. However, the behavior is somewhat different. While DedupStore is effective for large extents repeated verbatim between objects, a delta store would be effective for large blocks containing many small changes with a consistent pattern. This fits source code changes quite well and binary object changes extremely well. This can boost compression efficiency by 2x~130x for applicable content.

@aaronmondal
Copy link
Member

This type of compression is new to me, but would it look something like this?

  1. Compute TLSH hashes of inputs, store them.
  2. Assume that "similar inputs likely produce similar outputs" and group objects accordingly.
  3. When an action runs on inputs similar to a previous action (maybe it's the same command, just a file was edited), instead of storing outputs as an entirely new file, generate the (binary) diff between the new output and the old one and only store the diff.

This seems like an interesting problem that could potentially reduce stored data significantly or at least improve the time it takes to pass incremental changes around workers (if a worker has an artifact in a local memory store already, it might only need to fetch a diff instead of an entirely new file).

Am I understanding correctly that the dedup store does something very similar, but delta compression is more "specialized" towards incremental builds? If so, it seems intuitive that this kind of compression could outperform more the more general dedup approach.

cc @MarcusSorealheis It might be a very far reach, but would it theoretically be possible to offload similarity computations to a vector database like https://qdrant.tech/ or am I completely looking at the wrong thing here? 😅

@barrbrain
Copy link
Author

barrbrain commented May 9, 2024

@aaronmondal Thank you for the insightful link to Qdrant. This highlights the missing component in my proposal, an approximate nearest neighbors (ANN) algorithm. I note that there is an active Rust crate that implements hierarchical navigable small worlds (HNSW), hnsw_rs.

@MarcusSorealheis
Copy link
Collaborator

MarcusSorealheis commented May 9, 2024

Only a library is needed @aaronmondal in this case, not a DB. It is a good thought.

@allada
Copy link
Member

allada commented May 12, 2024

This is interesting. I've been thinking a bit about a very similar CS problem we have, in that I want to be better at finding workers that already have most of the assets and run jobs on them instead of LRU/MRU like we use now.

I started playing around with SimHash algorithms to see if they could solve my problem, but found them not quite what I was looking for in this case. I then started looking into KD-trees, but found it to scale horribly and then went on to ANN (approximate nearest neighbor), these worked great, but it felt like "bringing a tank to a knife fight". I eventually settled on Bloom Filter to be the best for this problem.

In the problem you are describing, I feel SimHash and/or ANN might do very well.

@barrbrain
Copy link
Author

barrbrain commented May 14, 2024

Notes from some offline analysis

I created a corpus to explore this issue by collecting a snapshot of a CAS with DedupStore defaults and no other filters. I built 4 consecutive tags of Chromium for Linux, such that there were a mix of objects shared verbatim between builds, unique objects and similar objects.

To test the quality of hash distance in predicting delta efficiency: for each approximate nearest neighbour, the size of compressed delta and compressed raw object were computed. Separating text and binary objects gives a clearer picture but overall they have similar characteristics. A relatively low distance threshold is sufficient for most of the available compression gains. For this corpus, 9.5% of objects had neighbours within a TLSH distance of 14 and with efficient delta encodings. On average, these deltas were 92% smaller than baseline compression.

Although this only represents a 11.9% improvement in storage density overall, there is a 12-times improvement for incremental changes in content. Edit: updated numbers with BSDIFF4+LZ4 deltas.

@barrbrain
Copy link
Author

barrbrain commented May 23, 2024

Here is an overview of the relationship between TLSH distance and delta efficiency over baseline compression, for this corpus. The y-axis is log2(delta/baseline). The x axis is TLSH distance between object pairs.
TLSH-256-QBSDIFF-LZ4

Edit: Using the above model, I refined the delta selection constraints. For this corpus, 17.8% of objects had neighbours within a TLSH-256-1 distance of 104 and delta encodings at least 4 times as efficient as baseline compression. On average, these deltas were 94.1% smaller than baseline compression. Although this only represents a 15.6% improvement in storage density overall, there is a 16.8-times improvement for incremental changes in content.

@allada
Copy link
Member

allada commented May 23, 2024

This is very interesting. the RBE working group is actively looking/discussing implementing something very similar to DedupStore as part of the protocol itself:
bazelbuild/remote-apis#282

I'll discuss with some of the internal team to see what kind of complexities and gotchas this might introduce.

@barrbrain
Copy link
Author

barrbrain commented Jun 11, 2024

In the problem you are describing, I feel SimHash and/or ANN might do very well.

In prior attempts at this issue, I have seen good results with a carefully constructed MinHash.

There is a straightforward reduction of TLSH to a cluster fingerprint:
Take a suffix of the code part and for each 2-bit column, reduce to 1-bit predicated on whether the value was in the top quartile.

This yields similar recall performance to HNSW with just a range query on the composite key.

An example from my corpus:

EC408104A021-T165B48E174216BCE0C8A859FE4627C6F005B0C957ACD378FEF6341AD6275ABADE808D47:
5f6e7fcec33b16a636b829d3a4408cf2d01c0404bea9457d7046b5eb22ade7f9-524288
EC408104A021-T12FB48E174216BCE0C8A859FE4627C6F005B0C957ACD378FEF5341AD6275ABADE808E47:
0052291bdaec9ba6429cde4ccfd52f0ba9c1b844ff69f53d861007b1dc770467-524288

Edit: This clustering method has a significant impact on the distribution:
TLSH-128-QBSDIFF-LZ4

Applying a similar approach within DedupStore and running the same reference builds, 13.5% of objects in the content store were delta-encoded and there was a 13.3% improvement in storage density overall.

@barrbrain
Copy link
Author

I built a prototype and tested it with daily chromium builds. It was able to achieve an overall ratio of 1:4 on compressible content. Ratios for small objects were 1:15 at best and large objects 1:213 at best.
I will propose a set of stores that may be composed to mimic git packing. In combination with DedupStore, they can mimic bup.

@MarcusSorealheis
Copy link
Collaborator

Thank you for all the hard work here.

@barrbrain
Copy link
Author

barrbrain commented Oct 27, 2024

There is a straightforward reduction of TLSH to a cluster fingerprint:
Take a suffix of the code part and for each 2-bit column, reduce to 1-bit predicated on whether the value was in the top quartile.

This yields similar recall performance to HNSW with just a range query on the composite key.

A enhancement to this method is to further reduce the prefix by taking to logical-or of bit pairs.
From the TLSH code part, 4-bit columns are reduced to 1-bit, with a 0.4375 probability of being set.
This bias may be addressed by inverting every other bit, which yields the following probabilities for 2 bits:
0.24609375, 0.31640625, 0.19140625, 0.24609375
This final transformation has no impact to exact prefix matches but provides the most uniform distribution function for every suffix. This appears to improve locality of ordered keys.
Sorting the corpus by prefix and considering nearest neighbor pairs by TLSH distance, the absolute difference in indices had quartiles: 5, 52, 998. As a fraction of corpus size: 1/183137, 1/17609, 1/ 918

simple-prefix-comb

This clustering method has a more significant impact on the distribution:

TLSH-128-QBSDIFF-LZ4-SIMPLE

I will eventually benchmark this with DedupStore integration. This method is a simplification of what I used in the prototype. (#901 (comment)) There I range-coded the "is in 4th quartile" symbols and truncated the output to 32 bits. The method above is equivalent to range coding 2 symbols at a time and truncating the output to 1 bit.

@barrbrain
Copy link
Author

barrbrain commented Dec 5, 2024

I will eventually benchmark this with DedupStore integration. This method is a simplification of what I used in the prototype. (#901 (comment)) There I range-coded the "is in 4th quartile" symbols and truncated the output to 32 bits. The method above is equivalent to range coding 2 symbols at a time and truncating the output to 1 bit.

With the aid of the store driver drafted in #1468, I measured an additional 3.8% improvement in storage density with this change in clustering method. Mere equivalence would have satisfied. 😄

@barrbrain
Copy link
Author

barrbrain commented Dec 24, 2024

An update with the same clustering method and aehobak-encoded deltas, which improve upon the efficiency of bsdiff.
The y-axis is log2(delta/baseline). The x-axis is TLSH distance. Note the significant change in scale of the y-axis.
Edit: Updated with results from revised format.

TLSH-128-AEHOBAK-LZ4-SIMPLE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants