-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding --long support for --patch-from #1959
Conversation
Tested this using tar of 4 clang versions:
After
|
I presume that, in this scenario, the difference is only the last 10.000.000 bytes. We can't avoid spending a few bytes per block, at least to indicate one large match, Hence, there is ground for improvement. One first question : |
|
OK, |
Tried this with a more likely scenario. The input file is just the dictionary with 10% of its bytes perturbed by +-5 and the gains from --long are basically non-existent. Generating file for compression import random
clangs = open("clangs", "rb").read(); clangs_str = clangs.decode('latin-1')
# randomly add int in range (-5, 5) inclusive to the original clangs on 10% of the bytes
clangs_mod_str = "".join(chr(ord(a)+random.randrange(-5,5)) if random.random() > 0.9 and ord(a)+5 < 256 and ord(a)-5 >= 0 else a for a in clangs_str)
clangs_mod = clangs_mod_str.encode('latin-1'); open("clangs_mod", "wb").write(clangs_mod) With --long ./zstd --single-thread -f --memory=300000000 --long --patch-from=clangs clangs_mod && ./zstd -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod : 32.04% (288859136 => 92544650 bytes, clangs_mod.zst)
clangs_mod.zst : 288859136 bytes Without --long ./old --single-thread -f --memory=300000000 --patch-from=clangs clangs_mod && ./old -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod : 32.07% (288859136 => 92651457 bytes, clangs_mod.zst)
clangs_mod.zst : 288859136 bytes |
It's better with less perturbations but still not super stellar. Here, the input file is just the dictionary with 1% of the bytes perturbed by 1 or -1. Generation: import random
clangs = open("clangs", "rb").read(); clangs_str = clangs.decode('latin-1')
# randomly add 1 or -1 to the original clangs on 1% of the bytes
clangs_mod_str = "".join(chr(ord(a)+random.choice([1, -1])) if random.random() < 0.01 and ord(a)+5 < 256 and ord(a)-5 >= 0 else a for a in clangs_str)
# check what percentage is different
print(sum(a != b for a, b in zip(clangs_str, clangs_mod_str))/len(clangs_str))
clangs_mod = clangs_mod_str.encode('latin-1'); open("clangs_mod", "wb").write(clangs_mod) With --long
Without --long
|
These scenarios are actually less representative of target use case. In no scenario does it make sense to randomly change one byte here and there, especially at such density (yes, even 1/100 is a lot). The expected scenario is that entire sections of reference will be preserved unmodified into target. target differs from reference by having : entire sections removed, and/or entire sections added, in no particular order nor position. Of course, a few random changes may also be present here and there, but sparsely (if they are dense, it's equivalent to a complete section change). With this definition, the very first scenario was actually closer to the goal. |
Hmm I see. Maybe a better scenario would be tar(clang x, clang x+2, clang x+4, clang x+6) as dict and tar(clang x+1, clang x+3, clang x+5, clang x+7) as data? Edit: I guess high data to data redundancy would still be an issue here. I guess we could manufacture something with low long distance data-data redundancy and high long distance data-dict redundancy. Anyway, we can chat in person more when you're back |
nope, still the same issue. A better scenario would be |
As seen in the commit above, I made a stupid mistake:( I was populating the ldm hash table with the src, not the dictionary. Just an FYI. The results still don't change much though (probably because the dictionary and src have been nearly identical in the tests above). I ran another with subsequent versions of clang using the default parameters from the commit above.
I think the implementation is doing what we would like it to now. Can someone give it a look and check to see if I'm missing something obvious? |
For example, at level 1, the hash table contains only 16K positions, so it will saturate very quickly (likely in the ~100KB range, in any case, way before 128 MB). However, this is a bit more complex to setup, as it now requires to find a rule depending on other compression parameters. This could be investigated later, in a dedicated optimization pass. |
This looks about correct, but it isn't working so there must be something wrong. I think it is really close to working, and if you fix a few bugs it'll be there. |
Looks like there were a few things missing. The main thing wrong was that I was computing the rolling hash starting at src when I should have been computing it starting at window.base. I also needed to increase the window log (to contain both dictionary and src) and update the window inside ldmstate when loading the dictionary. Anyway, the results are much better now.
|
FYI, according to the zstd specification, it's enough for the current position in This part of the specification is well respected by the "regular" match finder and the decoder. However, the LDM has not been combined with dictionary so far, therefore it may not be aware of this capability, and as a consequence, may not be able to reference matches beyond |
We should be able to fix it by respecting |
Okay, I think this is ready for a review now. The functionality is still the same after the last few commits:
|
Is |
Without it, the current implementation doesn't compress exceptionally well. See below. It's still better than not using long mode but hardly. I'll have to look into how long mode works together with multi-threading to uncover why this is. I was thinking that could be part of a different pr?
|
Okay, that's fine. All the additional machinery must be simplified. Special exception if we have good reasons to believe that some consequences shouldn't be transparently opt-in, such as huge memory usage (and then, it should be correctly documented). |
Yes, I believe --memory should be manually specified. Although, I think we should bump the default max to something larger than it is now. Maybe 1gb? Which would get rid of it for most use cases right? Maybe we can get rid of having the end user specify --long by automatically using long by default on large files (>128mb)? The user can then override this by setting --long=0 or something (although I don't think many will want to do this) Also I could just automatically turn off multithreading on patch from and notify the user instead of requiring that they explicitly input --single-thread until i get multithreaded working? With those changes, the cli should just be --patch-from=FILE for most use cases. |
I think I would be okay having
Multithreading support should be easy to add. You just need to load the dictionary into its ldmState in the same way you are doing it for the single-threaded one now. Until that diff is landed we can fail if |
Alternatively, for files larger than 128 MB, we could print a warning saying to use |
I agree.
Yes, although part of the game will be to determine what's the right threshold to trigger it.
Maybe @bimbashrestha will need some directions on this topic. You seem to have a pretty clear idea of what to do. |
Okay then, I made --patch-from=file imply --memory=filesize (that removes --memory) and I put 32 mb as the threshold for automatically activating long mode in patch-from (that removes --long) We can experiment with the value for the threshold later.
Didn't realize that was all we needed to do. I've just went ahead and included the multi-threaded changes in this pr. The ratio in multi-threaded mode is a little worse than in single-threaded mode though. Is this unusual?
|
This is probably expected, but I can help you investigate this tomorrow for learning and to double check. Basically we'll want to look at the sequences the single threaded encoder and the multithreaded encoder generate, and compare them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Question : |
I'll out up the blurb shortly:) Just need to also finish making a nice graph or two |
Hey guys, hey @bimbashrestha, here's the requested look at the While I'm actually more interested in patching backward, reasoning[0], this test is done forwards (which shouldn't matter much anyway). My files are:
I'm running zstd build from the git version 5b0a452, compiled with 1a)
1b)
1c) 41640286 bytes -> delta saves 41470784 bytes (99.59%) 2a)
2b)
2c) 25350493 bytes -> delta saves 324882 bytes (1.28%) 3a)
3b)
3c) 803820673 bytes -> delta saves 82715 bytes (0.01%) 4a)
4b)
4c) 807843840 bytes -> delta saves 3323173 bytes (0.41%) 5a)
5b)
5c) 72891396 bytes -> delta saves 686705 bytes (0.94%) 6a)
6b)
6c) 156160084 bytes -> delta saves 100054148 bytes (64.07%) 7a)
7b)
7c) 277690047 bytes -> delta saves 134677757 bytes (48.50%) 8a)
8b)
8c) 39571286 bytes -> delta saves 37938082 bytes (95.87%) ObservationsOpenStreetMap data is sorted by ID, so new IDs appear at the end, changes don't move the data in its position, just change their size, while deletes just removes the data. I suspect that the XML from Wikipedia is also sorted by key, but haven't confirmed that. But it contains a lot of large text chunks of static data. go-pie either holds a lot of static files, which doesn't change much or the compilation creates very stable binaries. The Linux kernel is hard to compress and extremely hard to diff - the second might be caused by an unstable compilation process? qcow2 images are particularly good to diff when there are low amounts of changes (lts distributions) The binary protocol buffer format (and probably also the export process) makes it nearly impossible to find unchanged data, therefore the diff barely makes an impact. ConclusionIt looks like the worst case is the same size as a regular compressed full update. If there's enough memory Btw: Any chances to lift the uint32 byte limit for the file size? [0] #2063 (comment) |
Thanks for the update @RubenKelevra! Just so we're on the same page here, I'm assuming |
Nope, it's all done with the same version (the quoted commit). If you like to have a comparison between this version and a different one I can do this as well. Which commit should I choose for a comparison? :) |
That's unexpected. Shouldn't |
Hmm yeah like @Cyan4973 mentioned, long mode should be automatically activated when the input is large enough. Is |
No, I wasn't doing it that fancy, but you can think of it like this. ;) I just didn't want to clutter my post with all those filenames :) The |
@RubenKelevra @Cyan4973 Ah I see the issue. It's this line here. Line 810 in 38a6d2a
That should be Glad we caught that. I'll put up a PR and a test fixing it shortly. |
Patch From
Zstandard is introducing a new command line option —patch-from= which leverages our existing compressors, dictionaries and the long range match finder to deliver a high speed engine for producing and applying patches to files.
Patch from increases the previous maximum limit for dictionaries from 32 MB to 2 GB. Additionally, it maintains fast speeds on lower compression levels without compromising patch size by using the long range match finder (now extended to find dictionary matches). By default, Zstandard uses a heuristic based on file size and internal compression parameters to determine when to activate long mode but it can also be manually specified as before.
Patch from also works with multi-threading mode at a minimal compression ratio loss vs single threaded mode.
Example usage:
Benchmarks:
We compared zstd to bsdiff, a popular industry grade diff engine. Our testing data were tarballs of different versions of source code from popular GitHub repositories. Specifically
Patch from on level 19 (with chainLog=30 and targetLength=4kb) remains competitive with bsdiff when comparing patch sizes.
And patch from greatly outperforms bsdiff in speed even on its slowest setting of level 19 boasting an average speedup of ~7X. Patch from is >200X faster on level 1 and >100X faster (shown below) on level 3 vs bsdiff while still delivering patch sizes less than 0.5% of the original file size.
And of course, there is no change to the fast zstd decompression speed.