-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a delta store (similar to DedupStore
)
#901
Comments
Hey @barrbrain, thanks for the feedback. Have you by chance looked at DedupStore? I believe it is extremely close to what you are suggesting. |
DedupStore
)
Hi @allada, |
This type of compression is new to me, but would it look something like this?
This seems like an interesting problem that could potentially reduce stored data significantly or at least improve the time it takes to pass incremental changes around workers (if a worker has an artifact in a local memory store already, it might only need to fetch a diff instead of an entirely new file). Am I understanding correctly that the dedup store does something very similar, but delta compression is more "specialized" towards incremental builds? If so, it seems intuitive that this kind of compression could outperform more the more general dedup approach. cc @MarcusSorealheis It might be a very far reach, but would it theoretically be possible to offload similarity computations to a vector database like https://qdrant.tech/ or am I completely looking at the wrong thing here? 😅 |
@aaronmondal Thank you for the insightful link to Qdrant. This highlights the missing component in my proposal, an approximate nearest neighbors (ANN) algorithm. I note that there is an active Rust crate that implements hierarchical navigable small worlds (HNSW), hnsw_rs. |
Only a library is needed @aaronmondal in this case, not a DB. It is a good thought. |
This is interesting. I've been thinking a bit about a very similar CS problem we have, in that I want to be better at finding workers that already have most of the assets and run jobs on them instead of LRU/MRU like we use now. I started playing around with SimHash algorithms to see if they could solve my problem, but found them not quite what I was looking for in this case. I then started looking into KD-trees, but found it to scale horribly and then went on to ANN (approximate nearest neighbor), these worked great, but it felt like "bringing a tank to a knife fight". I eventually settled on Bloom Filter to be the best for this problem. In the problem you are describing, I feel |
Notes from some offline analysisI created a corpus to explore this issue by collecting a snapshot of a CAS with To test the quality of hash distance in predicting delta efficiency: for each approximate nearest neighbour, the size of compressed delta and compressed raw object were computed. Separating text and binary objects gives a clearer picture but overall they have similar characteristics. A relatively low distance threshold is sufficient for most of the available compression gains. For this corpus, 9.5% of objects had neighbours within a TLSH distance of 14 and with efficient delta encodings. On average, these deltas were 92% smaller than baseline compression. Although this only represents a 11.9% improvement in storage density overall, there is a 12-times improvement for incremental changes in content. Edit: updated numbers with |
This is very interesting. the RBE working group is actively looking/discussing implementing something very similar to I'll discuss with some of the internal team to see what kind of complexities and gotchas this might introduce. |
I built a prototype and tested it with daily chromium builds. It was able to achieve an overall ratio of 1:4 on compressible content. Ratios for small objects were 1:15 at best and large objects 1:213 at best. |
Thank you for all the hard work here. |
A enhancement to this method is to further reduce the prefix by taking to logical-or of bit pairs. This clustering method has a more significant impact on the distribution: I will eventually benchmark this with |
With the aid of the store driver drafted in #1468, I measured an additional 3.8% improvement in storage density with this change in clustering method. Mere equivalence would have satisfied. 😄 |
An update with the same clustering method and aehobak-encoded deltas, which improve upon the efficiency of bsdiff. |
Inspiration is taken from how git packs achieve high compression rates with good random access performance. The 2 key components are (1) a clustering strategy to store similar objects together and (2) a delta algorithm to encode objects predicted by an existing object. The initial proposed algorithms for analysis are:
There are crates that implement these algorithms:
For background reading on how git achieves high density in a content-addressed store, see this blog post:
Git’s database internals I: packed object store
The combination of a
DeltaStore
andDedupStore
is similar to the architecture of bup.The text was updated successfully, but these errors were encountered: