-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea/request for help: zstd as a binary diff producer/patcher? #1935
Comments
These are all good points @gima , and we can certainly make some progresses along these directions.
Yes, this is allowed. The library detects this mode automatically, and applies it any time it doesn't find a valid header from
It's already the case, although maybe partially. The simple API is Streaming implementation is slightly more complex, and requires to initialize the context with Which leads us to our first issue : the dictionary API wasn't built for such a use case, and as a consequence, there is currently no way to reduce the memory load due to loading a huge "dictionary" as a reference point. This led to the artificial limitation of 32 MB for the dictionary size : we just want to protect users from some unexpected memory usage, which could be harmful to concurrent operation, or worse might trigger some form of resource denial attack. There is a second limitation : as defined in the RFC, the dictionary's content is only accessible as long as the content being compressed is not larger than the window size. This requires to set the window size to some large enough value to ensure this condition holds. We could change that, and implement some form of "reasonable defaults", that would automatically look for optimal parameters given the size of both the "reference data" and "content to compress". There's a bit of work involved, but that looks within reach. One important issue is that setting a large window size will push the same requirement onto the decoder, which can be an issue for large window sizes. Finally, a last issue is that such compressed frame does not contain any reference to the "base data" that must be used as a dictionary for proper decompression to happen. Proper content regeneration is still checked with a checksum (enabled by default in the CLI), but that's about it. |
This is a really interesting use case @gima! Thanks for posting the detailed issue. I plan to scope this out a little further this coming week. We will likely introduce a new cli command |
Just wanted to inform you both the patch-from was merged into dev today. Here is a brief summary and some benchmarks: #1959 (comment) Unless some things come up, this is what will make it to 1.4.5. Would appreciate it if you could try it on your use cases and provide some feedback before the release comes around:) Bimba |
Bi-directional patching is tracked in #2063. Closing since feature is merged. |
Summary
I would like to produce a diff file (quickly) of the changes from some binary file A to B (where B is a changed version of A). (i managed to do this with zstd, but, see below: "What I tried" section).
It could be said (from my naive viewpoint) that finding differences between files is somewhat the domain of dictionary-based compression programs. So why re-invent the wheel (and create yet another new software)?
Reasoning
For starters, there currently seems to be a lack of stand-alone utilities that do this. All of them seem to be tied to something else. Be it zsync (unmaintained, as far as I see) being tied to urls and http, bsdiff taking nearly 30 seconds to generate the diff file (whereas zstd does this in under a second). Then bigger tools such as casync require all-or-nothing adoption of their way of doing things.
Secondly, most(?) Linux distros provide package updates as a totally new files-to-be-downloaded. There are major bandwidth (and monetary) savings that could be had here if an efficient (and easy / stand alone) binary diff could be had.
.. And again, since zstd needs to find repetitions and their positions in files, exposing functionality that supports using all of this to produce and use diff files (or at least stapling this functionality to public api) could be a good fit here.
What I tried
I managed to use zstd to produce a very small diff file of changes from binary file A to B (with very fast creation time; less than 1sec) of around 1-2KB for both test cases ("simple" and "complex"). This small diff file was then given to zstd as the-file-to-be-decompressed and the original binary file (A) was given to zstd to use as dictionary. This procedure was able to reproduce the binary file B.
[Click to show transcript of the commands used]
preparing the file that's being used in this experiment: ```fish $ cp /bin/qemu-system-x86_64 bin ```splitting the binary file in two and showing that the when combined, the splits are equal to the original
putting second half of the file in place
listing current state of directory
test #1 ("simple"): compressing, decompressing and comparing (using the original binary file as dictionary)
test #2 ("complex"): compressing, decompressing and comparing (using the original binary file as dictionary)
listing current state of directory
Question #1:
As can be witnessed, I found that I can give to zstd any file to be used as a dictionary. zstd happily ingests it, even if the given dictionary file was not generated using the zstd's own --train argument.
.. Question: Is this allowed? Can I rely on zstd allowing me to do this in the future?
Question #2:
Would it be reasonable to expose this functionality via the API in some way, so that the "otherwise unnecessary parts" (whatever they are) could be avoided?
Question #3:
Currently I ran against a wall when trying this procedure on files bigger than 32MB. zstd refuses to use dictionaries bigger than this:
..the error
The text was updated successfully, but these errors were encountered: