Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea/request for help: zstd as a binary diff producer/patcher? #1935

Closed
gima opened this issue Dec 25, 2019 · 4 comments
Closed

idea/request for help: zstd as a binary diff producer/patcher? #1935

gima opened this issue Dec 25, 2019 · 4 comments
Assignees

Comments

@gima
Copy link

gima commented Dec 25, 2019

Summary

I would like to produce a diff file (quickly) of the changes from some binary file A to B (where B is a changed version of A). (i managed to do this with zstd, but, see below: "What I tried" section).

It could be said (from my naive viewpoint) that finding differences between files is somewhat the domain of dictionary-based compression programs. So why re-invent the wheel (and create yet another new software)?

Reasoning

For starters, there currently seems to be a lack of stand-alone utilities that do this. All of them seem to be tied to something else. Be it zsync (unmaintained, as far as I see) being tied to urls and http, bsdiff taking nearly 30 seconds to generate the diff file (whereas zstd does this in under a second). Then bigger tools such as casync require all-or-nothing adoption of their way of doing things.

Secondly, most(?) Linux distros provide package updates as a totally new files-to-be-downloaded. There are major bandwidth (and monetary) savings that could be had here if an efficient (and easy / stand alone) binary diff could be had.

.. And again, since zstd needs to find repetitions and their positions in files, exposing functionality that supports using all of this to produce and use diff files (or at least stapling this functionality to public api) could be a good fit here.

What I tried

I managed to use zstd to produce a very small diff file of changes from binary file A to B (with very fast creation time; less than 1sec) of around 1-2KB for both test cases ("simple" and "complex"). This small diff file was then given to zstd as the-file-to-be-decompressed and the original binary file (A) was given to zstd to use as dictionary. This procedure was able to reproduce the binary file B.

[Click to show transcript of the commands used] preparing the file that's being used in this experiment: ```fish $ cp /bin/qemu-system-x86_64 bin ```

splitting the binary file in two and showing that the when combined, the splits are equal to the original

$ split -n 2 -d bin bin.split-

$ cat bin.split-00 bin.split-01 | diff -s bin /dev/stdin
Files bin and /dev/stdin are identical

putting second half of the file in place

$ cat bin.split-01 bin.split-00 > binrev

$ cat bin.split-01 > bintailhalf

listing current state of directory

$ lf
16227960  ./bin
8113980   ./bin.split-00
8113980   ./bin.split-01
16227960  ./binrev
8113980   ./bintailhalf

test #1 ("simple"): compressing, decompressing and comparing (using the original binary file as dictionary)

# compressing
$ zstd -D bin --long=31 --zstd=ldmHashRateLog=25,chainLog=28 -vv -f -3 binrev -o binrev.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary 
binrev               :  0.01%   (16227960 =>   1521 bytes, binrev.bindict.zstd) 
binrev               : Completed in 0.16 sec  (cpu load : 98%)

# decompressing
$ zstd -D bin -vv -d -o binrev.bindict.zstd.decompressed binrev.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary 
binrev.bindict.zstd : 16227960 bytes

# comparing to original
$ diff -s binrev binrev.bindict.zstd.decompressed
Files binrev and binrev.bindict.zstd.decompressed are identical

test #2 ("complex"): compressing, decompressing and comparing (using the original binary file as dictionary)

# compressing
$ zstd -D bin --long=31 --zstd=ldmHashRateLog=25,chainLog=28 -vv -f -3 bintailhalf -o bintailhalf.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary 
bintailhalf          :  0.01%   (8113980 =>    742 bytes, bintailhalf.bindict.zstd) 
bintailhalf          : Completed in 0.13 sec  (cpu load : 100%)

# decompressing
$ zstd -D bin -vv -d -o bintailhalf.bindict.zstd.decompressed bintailhalf.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary 
bintailhalf.bindict.zstd: 8113980 bytes

# comparing to original
$ diff -s bintailhalf bintailhalf.bindict.zstd.decompressed
Files bintailhalf and bintailhalf.bindict.zstd.decompressed are identical

listing current state of directory

$ lf
16227960  ./bin
8113980   ./bin.split-00
8113980   ./bin.split-01
16227960  ./binrev
8113980   ./bintailhalf
742       ./bintailhalf.bindict.zstd
8113980   ./bintailhalf.bindict.zstd.decompressed
1521      ./binrev.bindict.zstd
16227960  ./binrev.bindict.zstd.decompressed

Question #1:
As can be witnessed, I found that I can give to zstd any file to be used as a dictionary. zstd happily ingests it, even if the given dictionary file was not generated using the zstd's own --train argument.
.. Question: Is this allowed? Can I rely on zstd allowing me to do this in the future?

Question #2:
Would it be reasonable to expose this functionality via the API in some way, so that the "otherwise unnecessary parts" (whatever they are) could be avoided?

Question #3:
Currently I ran against a wall when trying this procedure on files bigger than 32MB. zstd refuses to use dictionaries bigger than this:

..the error

*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading /usr/bin/docker as dictionary 
zstd: error 32 : Dictionary file /usr/bin/docker is too large (> 32 MB) 
Question: Is this a necessity? Could this check be removed from the source code without any ill effect? (thus allowing bigger binaries to be "diffed")
@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 26, 2019

These are all good points @gima , and we can certainly make some progresses along these directions.

I found that I can give to zstd any file to be used as a dictionary.

Yes, this is allowed.
In this mode, the dictionary is reduced to its pure content, and doesn't hold any metadata, such as a dictID field for example. It's still completely valid, for compression and decompression.

The library detects this mode automatically, and applies it any time it doesn't find a valid header from zstd --train. For completeness, it's possible to finely control this behavior (rather than letting it figure it out automatically) using the dictContentMode control variable : https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L1136 .
This is optional : only specific corner cases should ever need such control.

Would it be reasonable to expose this functionality via the API in some way,

It's already the case, although maybe partially.

The simple API is ZSTD_compress_usingDict() and ZSTD_decompress_usingDict()`. It's compatible with this scenario, though it requires both the dictionary and the target content to be present entirely in memory.

Streaming implementation is slightly more complex, and requires to initialize the context with ZSTD_CCtx_loadDictionary()
( and similarly ZSTD_DCtx_loadDictionary() for decompression ). Note that it still requires the content of the "dictionary" to be entirely present in memory.

Which leads us to our first issue : the dictionary API wasn't built for such a use case, and as a consequence, there is currently no way to reduce the memory load due to loading a huge "dictionary" as a reference point.

This led to the artificial limitation of 32 MB for the dictionary size : we just want to protect users from some unexpected memory usage, which could be harmful to concurrent operation, or worse might trigger some form of resource denial attack.
Hence, removing this limit, while perfectly possible from a library stand point, should be implemented in a way which still protects regular users from exceptional side effects, such as, for example, some opt-in mechanism, like existing --memory=# for example.

There is a second limitation : as defined in the RFC, the dictionary's content is only accessible as long as the content being compressed is not larger than the window size. This requires to set the window size to some large enough value to ensure this condition holds.
Right now, this condition must be ensured manually, because the window size is mostly a matter of compression level. This is not easy to figure it out, although the explanation is buried in the depth of documentation.

We could change that, and implement some form of "reasonable defaults", that would automatically look for optimal parameters given the size of both the "reference data" and "content to compress". There's a bit of work involved, but that looks within reach. One important issue is that setting a large window size will push the same requirement onto the decoder, which can be an issue for large window sizes.

Finally, a last issue is that such compressed frame does not contain any reference to the "base data" that must be used as a dictionary for proper decompression to happen. Proper content regeneration is still checked with a checksum (enabled by default in the CLI), but that's about it.
The user must know that a "base reference" is necessary, and which one, from some "external channel", unspecified. One could could rely on some implicit knowledge from the local context, or implement an ad-hoc system to embed this information, of even consider using a "skippable frame" to transport it alongside the compressed content. Just know that, at this stage, this is completely outside of the scope of RFC specification.

@bimbashrestha
Copy link
Contributor

This is a really interesting use case @gima! Thanks for posting the detailed issue. I plan to scope this out a little further this coming week. We will likely introduce a new cli command --diff-from=#. More on this soon.

@bimbashrestha
Copy link
Contributor

Hi @vrubleg and @gima,

Just wanted to inform you both the patch-from was merged into dev today. Here is a brief summary and some benchmarks: #1959 (comment)

Unless some things come up, this is what will make it to 1.4.5. Would appreciate it if you could try it on your use cases and provide some feedback before the release comes around:)

Bimba

@terrelln
Copy link
Contributor

terrelln commented May 1, 2020

Bi-directional patching is tracked in #2063. Closing since feature is merged.

@terrelln terrelln closed this as completed May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants