Adding --long support for --patch-from #1959

bimbashrestha · 2020-01-18T00:04:17Z

Patch From
Zstandard is introducing a new command line option —patch-from= which leverages our existing compressors, dictionaries and the long range match finder to deliver a high speed engine for producing and applying patches to files.

Patch from increases the previous maximum limit for dictionaries from 32 MB to 2 GB. Additionally, it maintains fast speeds on lower compression levels without compromising patch size by using the long range match finder (now extended to find dictionary matches). By default, Zstandard uses a heuristic based on file size and internal compression parameters to determine when to activate long mode but it can also be manually specified as before.

Patch from also works with multi-threading mode at a minimal compression ratio loss vs single threaded mode.

Example usage:

# create the patch
zstd --patch-from=<oldfile> <newfile> -o <patchfile>

# apply the patch
zstd -d --patch-from=<oldfile> <patchfile> -o <newfile>

Benchmarks:
We compared zstd to bsdiff, a popular industry grade diff engine. Our testing data were tarballs of different versions of source code from popular GitHub repositories. Specifically

repos = {
    # ~31mb (small file)
    "zstd": {"url": "https://github.com/facebook/zstd", "dict-branch": "refs/tags/v1.4.2", "src-branch": "refs/tags/v1.4.3"},
    # ~273mb (medium file)
    "wordpress": {"url": "https://github.com/WordPress/WordPress", "dict-branch": "refs/tags/5.3.1", "src-branch": "refs/tags/5.3.2"},
    # ~1.66gb (large file)
    "llvm": {"url": "https://github.com/llvm/llvm-project", "dict-branch": "refs/tags/llvmorg-9.0.0", "src-branch": "refs/tags/llvmorg-9.0.1"}
}

Patch from on level 19 (with chainLog=30 and targetLength=4kb) remains competitive with bsdiff when comparing patch sizes.

And patch from greatly outperforms bsdiff in speed even on its slowest setting of level 19 boasting an average speedup of ~7X. Patch from is >200X faster on level 1 and >100X faster (shown below) on level 3 vs bsdiff while still delivering patch sizes less than 0.5% of the original file size.

And of course, there is no change to the fast zstd decompression speed.

bimbashrestha · 2020-01-18T00:15:39Z

Tested this using tar of 4 clang versions:
Before

head -c 210000000 clangs > dict && head -c 200000000 clangs > data && ./zstd -f --memory=210000000 --single-thread --patch-from=dict data
data                 : 16.14%   (200000000 => 32275377 bytes, data.zst)

After

head -c 210000000 clangs > dict && head -c 200000000 clangs > data && ./zstd --long -f --memory=210000000 --single-thread --patch-from=dict data
data                 :  7.77%   (200000000 => 15542116 bytes, data.zst)

Cyan4973 · 2020-01-18T00:24:47Z

I presume that, in this scenario, the difference is only the last 10.000.000 bytes.
And since data is actually smaller than dict, it should consist entirely of references into dict.

We can't avoid spending a few bytes per block, at least to indicate one large match,
but such a budget should be < 30 bytes per block. That's still much less than the resulting 15MB in the test.

Hence, there is ground for improvement.

One first question :
--long, by default, set the windowLog to 27, 128 MB.
Since the size of data to compress is 200.000.000 bytes, it requires a larger windowLog, of 28.
I presume the --patch-from=FILE command does this adaptation automatically ?

bimbashrestha · 2020-01-18T00:39:24Z

I presume the --patch-from=FILE command does this adaptation automatically ?

Yes, see: https://github.com/bimbashrestha/zstd/blob/a2917fe814a3c9f41b1723de8a26faaff28bfd7f/programs/fileio.c#L821

Cyan4973 · 2020-01-18T00:48:55Z

OK,
let's spend some time on this scenario next week.
I think it's a simple one, it's relatively simple to define an "optimal" behavior for it,
and we should aim at getting it right, even if it demands new adaptations to work properly.

bimbashrestha · 2020-01-22T18:09:10Z

Tried this with a more likely scenario. The input file is just the dictionary with 10% of its bytes perturbed by +-5 and the gains from --long are basically non-existent.

Generating file for compression

import random
clangs = open("clangs", "rb").read(); clangs_str = clangs.decode('latin-1')

# randomly add int in range (-5, 5) inclusive to the original clangs on 10% of the bytes
clangs_mod_str = "".join(chr(ord(a)+random.randrange(-5,5)) if random.random() > 0.9 and ord(a)+5 < 256 and ord(a)-5 >= 0 else a for a in clangs_str)

clangs_mod = clangs_mod_str.encode('latin-1'); open("clangs_mod", "wb").write(clangs_mod)

With --long

./zstd --single-thread -f --memory=300000000 --long --patch-from=clangs clangs_mod && ./zstd -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 32.04%   (288859136 => 92544650 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

Without --long

./old --single-thread -f --memory=300000000 --patch-from=clangs clangs_mod && ./old -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 32.07%   (288859136 => 92651457 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

bimbashrestha · 2020-01-22T18:45:19Z

It's better with less perturbations but still not super stellar. Here, the input file is just the dictionary with 1% of the bytes perturbed by 1 or -1.

Generation:

import random
clangs = open("clangs", "rb").read(); clangs_str = clangs.decode('latin-1')

# randomly add  1 or -1 to the original clangs on 1% of the bytes
clangs_mod_str = "".join(chr(ord(a)+random.choice([1, -1])) if random.random() < 0.01 and ord(a)+5 < 256 and ord(a)-5 >= 0 else a for a in clangs_str)

# check what percentage is different 
print(sum(a != b for a, b in zip(clangs_str, clangs_mod_str))/len(clangs_str))

clangs_mod = clangs_mod_str.encode('latin-1'); open("clangs_mod", "wb").write(clangs_mod)

With --long

./zstd --single-thread -f --memory=300000000 --long --patch-from=clangs clangs_mod && ./zstd -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 9.26%   (288859136 => 92544650 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

Without --long

./old --single-thread -f --memory=300000000 --patch-from=clangs clangs_mod && ./old -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 17.95%  (288859136 => 92651457 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

Cyan4973 · 2020-01-22T21:42:37Z

These scenarios are actually less representative of target use case.

In no scenario does it make sense to randomly change one byte here and there, especially at such density (yes, even 1/100 is a lot).

The expected scenario is that entire sections of reference will be preserved unmodified into target. target differs from reference by having : entire sections removed, and/or entire sections added, in no particular order nor position. Of course, a few random changes may also be present here and there, but sparsely (if they are dense, it's equivalent to a complete section change).

With this definition, the very first scenario was actually closer to the goal.
But there is a twist : this particular scenario was built from aggregation of 4 similar tarball, meaning internal repetitions are already very high. We'll see together that this is an issue.

bimbashrestha · 2020-01-22T23:06:48Z

Hmm I see. Maybe a better scenario would be tar(clang x, clang x+2, clang x+4, clang x+6) as dict and tar(clang x+1, clang x+3, clang x+5, clang x+7) as data?

Edit: I guess high data to data redundancy would still be an issue here. I guess we could manufacture something with low long distance data-data redundancy and high long distance data-dict redundancy.

Anyway, we can chat in person more when you're back

Cyan4973 · 2020-01-22T23:08:04Z

nope, still the same issue. A better scenario would be clang x as reference, and clang x+1 as target.

bimbashrestha · 2020-01-23T17:56:13Z

As seen in the commit above, I made a stupid mistake:( I was populating the ldm hash table with the src, not the dictionary. Just an FYI. The results still don't change much though (probably because the dictionary and src have been nearly identical in the tests above). I ran another with subsequent versions of clang using the default parameters from the commit above.

tar cvf data ~/Downloads/clangs/llvm-9.0.1.src
tar cvf dict ~/Downloads/clangs/llvm-9.0.0.src

# with --long
./zstd --single-thread -f --memory=500000000 --long --patch-from=dict data
data                 : 11.72%   (425910272 => 49898506 bytes, data.zst)

# without --long
./old --single-thread -f --memory=500000000 --patch-from=dict data
data                 : 11.72%   (425910272 => 49905559 bytes, data.zst)

# without dictionary 
./zstd -f --long data
data                 : 11.31%   (425910272 => 48162576 bytes, data.zst)

I think the implementation is doing what we would like it to now. Can someone give it a look and check to see if I'm missing something obvious?

Cyan4973 · 2020-01-24T17:31:10Z

Forces --long when using --patch-from on dictionaries larger than 128 mb

--long will likely be needed way before that point.
But it's more complex, as the threshold depends on the compression parameters.

For example, at level 1, the hash table contains only 16K positions, so it will saturate very quickly (likely in the ~100KB range, in any case, way before 128 MB). --long will be able to compensate beyond saturation point.

However, this is a bit more complex to setup, as it now requires to find a rule depending on other compression parameters. This could be investigated later, in a dedicated optimization pass.

terrelln · 2020-01-31T06:08:36Z

This looks about correct, but it isn't working so there must be something wrong. I think it is really close to working, and if you fix a few bugs it'll be there.

bimbashrestha · 2020-02-03T20:57:35Z

Looks like there were a few things missing. The main thing wrong was that I was computing the rolling hash starting at src when I should have been computing it starting at window.base. I also needed to increase the window log (to contain both dictionary and src) and update the window inside ldmstate when loading the dictionary.

Anyway, the results are much better now.

tar cf llvm-9.0.1 ~/Downloads/clangs/llvm-9.0.1.src
tar cf llvm-9.0.0 ~/Downloads/clangs/llvm-9.0.0.src

./zstd --single-thread --memory=500000000 --long --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           :  0.48%   (425910272 => 2062253 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst       : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

Cyan4973 · 2020-02-03T21:07:12Z

I also needed to increase the window log (to contain both dictionary and src)

FYI, according to the zstd specification, it's enough for the current position in src to be < windowSize in order to access the entire dictionary.
In a silly case, you could have srcSize == 16 KB , therefore windowLog == 14, and it would still give access to the entire dictionary, even if it's 128 MB long, and therefore could generate offset way larger than windowSize.

This part of the specification is well respected by the "regular" match finder and the decoder.

However, the LDM has not been combined with dictionary so far, therefore it may not be aware of this capability, and as a consequence, may not be able to reference matches beyond windowSize. In which case, a work-around is indeed necessary.

terrelln · 2020-02-03T21:37:53Z

However, the LDM has not been combined with dictionary so far, therefore it may not be aware of this capability, and as a consequence, may not be able to reference matches beyond windowSize. In which case, a work-around is indeed necessary.

We should be able to fix it by respecting loadedDictEnd in the LDM match finder. We shouldn't increase the window log to workaround this issue, since that will increase the decompression memory usage.

bimbashrestha · 2020-02-03T22:49:36Z

TODO: make dictionary ldm respect loaddictendl like @terrelln mentioned above.

Also linking the original thread here because I forgot to before: #1935

bimbashrestha · 2020-02-04T22:24:16Z

Okay, I think this is ready for a review now. The functionality is still the same after the last few commits:

./zstd --single-thread --memory=500000000 --long -f --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 -f --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           :  0.48%   (425910272 => 2062279 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

Cyan4973 · 2020-02-04T22:25:41Z

Is --single-thread still required ?

bimbashrestha · 2020-02-04T22:38:18Z

Is --single-thread still required ?

Without it, the current implementation doesn't compress exceptionally well. See below. It's still better than not using long mode but hardly. I'll have to look into how long mode works together with multi-threading to uncover why this is. I was thinking that could be part of a different pr?

# without single-thread
./zstd --memory=500000000 -f --long --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 -f --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           : 11.30%   (425910272 => 48130136 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

# --single-thread without --long
./zstd --memory=500000000 -f --single-thread --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 -f --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           : 11.72%   (425910272 => 49905685 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

Cyan4973 · 2020-02-04T23:57:31Z

I was thinking that could be part of a different pr?

Okay, that's fine.
Just keep in mind : at the end of this serie of patches, the resulting command must be "simple" to understand and invoke from an end-user perspective.
--patch-from= is.
--memory=# --long --single-thread --patch-from= isn't.

All the additional machinery must be simplified.

Special exception if we have good reasons to believe that some consequences shouldn't be transparently opt-in, such as huge memory usage (and then, it should be correctly documented).

bimbashrestha · 2020-02-05T00:18:51Z

Just keep in mind : at the end of this serie of patches, the resulting command must be "simple" to understand and invoke from an end-user perspective.

Yes, I believe --memory should be manually specified. Although, I think we should bump the default max to something larger than it is now. Maybe 1gb? Which would get rid of it for most use cases right?

Maybe we can get rid of having the end user specify --long by automatically using long by default on large files (>128mb)? The user can then override this by setting --long=0 or something (although I don't think many will want to do this)

Also I could just automatically turn off multithreading on patch from and notify the user instead of requiring that they explicitly input --single-thread until i get multithreaded working?

With those changes, the cli should just be --patch-from=FILE for most use cases.

terrelln · 2020-02-05T00:50:21Z

Yes, I believe --memory should be manually specified. Although, I think we should bump the default max to something larger than it is now. Maybe 1gb? Which would get rid of it for most use cases right?

I think I would be okay having --patch-from=X imply --memory=FileSize and --long when the file size is known and large. I do not want to increase the default --memory. That controls the maximum memory usage of the decoder in the standard case. 128 MB is already quite big. When users are using --patch-from=X they are getting what they pay for if we increase the memory usage, and it only affects them. Though we should note it in the help/man page.

Also I could just automatically turn off multithreading on patch from and notify the user instead of requiring that they explicitly input --single-thread until i get multithreaded working?

Multithreading support should be easy to add. You just need to load the dictionary into its ldmState in the same way you are doing it for the single-threaded one now. Until that diff is landed we can fail if --single-thread isn't passed to the CLI, since we don't expect it to make it into a release.

terrelln · 2020-02-05T00:55:00Z

Alternatively, for files larger than 128 MB, we could print a warning saying to use --memory=X to ensure the dictionary is always in-bounds, but still compress the file with reduced compression ratio.

Cyan4973 · 2020-02-05T00:57:37Z

I think I would be okay having --patch-from=X imply --memory=FileSize

I agree.
Maybe a (non-blocking) warning message (displayLevel 2) would also be worthwhile when the memory required becomes very large, such as > 128 MB.

Maybe we can get rid of having the end user specify --long by automatically using long by default on large files (>128mb)?

Yes, although part of the game will be to determine what's the right threshold to trigger it.
I expect it to be far smaller than 128 MB. Any file substantially larger than cycleLog is at risk of missing pointers from the early section of the reference files.
I guess it will be useful to make tests with intermediate sizes, and determine when it's possible to not use --long mode.

Multithreading support should be easy to add.

Maybe @bimbashrestha will need some directions on this topic. You seem to have a pretty clear idea of what to do.

bimbashrestha · 2020-02-05T03:32:06Z

Okay then, I made --patch-from=file imply --memory=filesize (that removes --memory) and I put 32 mb as the threshold for automatically activating long mode in patch-from (that removes --long) We can experiment with the value for the threshold later.

Multithreading support should be easy to add. You just need to load the dictionary into its ldmState in the same way you are doing it for the single-threaded one now. Until that diff is landed we can fail if --single-thread

Didn't realize that was all we needed to do. I've just went ahead and included the multi-threaded changes in this pr. The ratio in multi-threaded mode is a little worse than in single-threaded mode though. Is this unusual?

# --single-thread
./zstd --single-thread --patch-from=llvm-9.0.0 llvm-9.0.1
llvm-9.0.1           :  0.48%   (425910272 => 2062279 bytes, llvm-9.0.1.zst)

# vs no --single-thread
./zstd --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d  --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1
llvm-9.0.1           :  0.55%   (425910272 => 2337712 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

terrelln · 2020-02-05T05:43:32Z

Didn't realize that was all we needed to do. I've just went ahead and included the multi-threaded changes in this pr. The ratio in multi-threaded mode is a little worse than in single-threaded mode though. Is this unusual?

This is probably expected, but I can help you investigate this tomorrow for learning and to double check. Basically we'll want to look at the sequences the single threaded encoder and the multithreaded encoder generate, and compare them.

tests/fuzzer.c

lib/compress/zstd_compress.c

programs/fileio.c

terrelln

Awesome!

programs/fileio.c

Cyan4973 · 2020-04-17T19:19:38Z

Question :
we mentioned a need for a paragraph describing the feature, to be used at release time.
Where is it ?

bimbashrestha · 2020-04-17T19:32:52Z

I'll out up the blurb shortly:) Just need to also finish making a nice graph or two

bimbashrestha · 2020-04-17T20:59:26Z

Alright finally merged!:)) Also updated the PR description with the blurb @Cyan4973 and @terrelln

RubenKelevra · 2020-04-19T00:23:27Z

Hey guys, hey @bimbashrestha,

here's the requested look at the --long parameter. I also compared a full update with a compressed file vs a delta update with various real-world data.

While I'm actually more interested in patching backward, reasoning[0], this test is done forwards (which shouldn't matter much anyway).

My files are:

OpenStreetMap XML data for Luxembourg with a 2 day difference (Filesize ~390 MB)
OpenStreetMap ProtocolBufferFormat data for Luxembourg with a 2 day difference (Filesize ~25 MB)
OpenStreetMap ProtocolBufferFormat data for Australia & Oceania with a 1 month difference (Filesize ~766 MB)
OpenStreetMap ProtocolBufferFormat data for Australia & Oceania with a 1 day difference (Filesize ~766 MB)
ArchLinux x86_64 linux kernel package 5.6.4 to 5.6.5 uncompressed (Filesize ~80 MB)
ArchLinux x86_64 go-pie package 1.14.1 to 1.14.2 uncompressed (Filesize ~641 MB)
CentOS 7 x86_64 GenericCloud qcow2-image with version hop 1905 to 1907 (Filesize ~897 MB)
en wikipedia news articles (pages) as xml one month difference (Filesize ~195 MB)

I'm running zstd build from the git version 5b0a452, compiled with -march=native -O2, running on an 'i3-2330M'. Data is stored and loaded on a SATA-SSD and the system has 12 GB memory.

1a) delta -15:

luxembourg-200417.osm :  0.88%   (410250506 => 3628882 bytes, luxembourg-200415-to-200417.osm.zst_patch)
real    1m53,091s
user    1m51,593s
sys     0m1,070s

1b) delta -15 --long:

luxembourg-200417.osm :  0.04%   (410250506 => 169502 bytes, luxembourg-200415-to-200417.osm.zst_patch)
real    1m57,944s
user    1m56,030s
sys     0m1,282s

1c) full update -15 --long:

41640286 bytes -> delta saves 41470784 bytes (99.59%)

2a) delta -15:

luxembourg-200417.osm.pbf : 98.79%   (25349898 => 25043001 bytes, luxembourg-200415-to-200417.osm.pbf.zst_patch)
real    0m5,329s
user    0m5,158s
sys     0m0,161s

2b) delta -15 --long:

luxembourg-200417.osm.pbf : 98.72%   (25349898 => 25025611 bytes, luxembourg-200415-to-200417.osm.pbf.zst_patch)
real    0m5,775s
user    0m5,636s
sys     0m0,133s

2c) full update -15 --long:

25350493 bytes -> delta saves 324882 bytes (1.28%)

3a) delta -15:

australia-oceania-200401.osm.pbf :100.00%   (803827262 => 803820669 bytes, australia-oceania-200301-to-200401.osm.pbf.zst_patch)
real    3m23,253s
user    3m9,292s
sys     0m3,464s

3b) delta -15 --long:

australia-oceania-200401.osm.pbf : 99.99%   (803827262 => 803737958 bytes, australia-oceania-200301-to-200401.osm.pbf.zst_patch)
real    3m44,705s
user    3m26,232s
sys     0m4,313s

3c) full update -15 --long:

803820673 bytes -> delta saves 82715 bytes (0.01%)

4a) delta -15:

australia-oceania-200417.osm.pbf :100.00%   (807853291 => 807844053 bytes, australia-oceania-200416-to-200417.osm.pbf.zst_patch)
real    3m21,324s
user    3m10,312s
sys     0m3,081s

4b) delta -15 --long:

australia-oceania-200417.osm.pbf : 99.59%   (807853291 => 804520667 bytes, australia-oceania-200416-to-200417.osm.pbf.zst_patch)
real    3m36,696s
user    3m27,023s
sys     0m3,033s

4c) full update -15 --long:

807843840 bytes -> delta saves 3323173 bytes (0.41%)

5a) delta -15:

linux-5.6.5.arch1-1-x86_64.pkg.tar : 86.53%   (83640320 => 72370676 bytes, linux-5.6.4-to-5.6.5.arch1-1-x86_64.pkg.tar.zst_patch)
real    0m20,083s
user    0m19,092s
sys     0m0,517s

5b) delta -15 --long:

linux-5.6.5.arch1-1-x86_64.pkg.tar : 86.33%   (83640320 => 72204691 bytes, linux-5.6.4-to-5.6.5.arch1-1-x86_64.pkg.tar.zst_patch)
real    0m20,919s
user    0m20,391s
sys     0m0,421s

5c) full update -15 --long:

72891396 bytes -> delta saves 686705 bytes (0.94%)

6a) delta -15:

go-pie-2_1.14.2-1-x86_64.pkg.tar : 12.51%   (672358400 => 84082084 bytes, go-pie-2_1.14.1-to-1.14.2-1-x86_64.pkg.tar.zst_patch)
real    5m12,541s
user    5m6,284s
sys     0m2,869s

6b) delta -15 --long:

go-pie-2_1.14.2-1-x86_64.pkg.tar :  8.34%   (672358400 => 56105936 bytes, go-pie-2_1.14.1-to-1.14.2-1-x86_64.pkg.tar.zst_patch)
real    4m55,772s
user    4m52,609s
sys     0m1,777s

6c) full update -15 --long:

156160084 bytes -> delta saves 100054148 bytes (64.07%)

7a) delta -15:

CentOS-7-x86_64-GenericCloud-1907.qcow2 : 27.55%   (942407680 => 259600155 bytes, CentOS-7-x86_64-GenericCloud-1905-to-1907.qcow2.zst_patch)
real    5m43,531s
user    5m34,114s
sys     0m3,793s

7b) delta -15 --long:

CentOS-7-x86_64-GenericCloud-1907.qcow2 : 15.18%   (942407680 => 143012290 bytes, CentOS-7-x86_64-GenericCloud-1905-to-1907.qcow2.zst_patch)
real    4m9,574s
user    4m4,923s
sys     0m3,366s

7c) full update -15 --long:

277690047 bytes -> delta saves 134677757 bytes (48.50%)

8a) delta -15:

enwikinews-20200401-pages-articles.xml :  4.53%   (204994714 => 9285772 bytes, enwikinews-20200301-to-20200401-pages-articles.xml.zst_patch)
real    2m25,614s
user    2m14,190s
sys     0m2,773s

8b) delta -15 --long:

enwikinews-20200401-pages-articles.xml :  0.80%   (204994714 => 1633204 bytes, enwikinews-20200301-to-20200401-pages-articles.xml.zst_patch)
real    1m44,225s
user    1m40,599s
sys     0m0,909s

8c) full update -15 --long:

39571286 bytes -> delta saves 37938082 bytes (95.87%)

Observations

OpenStreetMap data is sorted by ID, so new IDs appear at the end, changes don't move the data in its position, just change their size, while deletes just removes the data.

I suspect that the XML from Wikipedia is also sorted by key, but haven't confirmed that. But it contains a lot of large text chunks of static data.

go-pie either holds a lot of static files, which doesn't change much or the compilation creates very stable binaries.

The Linux kernel is hard to compress and extremely hard to diff - the second might be caused by an unstable compilation process?

qcow2 images are particularly good to diff when there are low amounts of changes (lts distributions)

The binary protocol buffer format (and probably also the export process) makes it nearly impossible to find unchanged data, therefore the diff barely makes an impact.

Conclusion

It looks like the worst case is the same size as a regular compressed full update.

If there's enough memory --long should always be used, since it barley have impacts on the compression speed, but have a huge potential to decrease the patch size.

Btw: Any chances to lift the uint32 byte limit for the file size?

[0] #2063 (comment)

bimbashrestha · 2020-04-20T17:41:53Z

Thanks for the update @RubenKelevra!

Just so we're on the same page here, I'm assuming delta -15 is zstd --patch-from before this PR was merged, delta -15 --long is zstd--patch-from after this PR was merged and full update is just zstd -15 without patch-from?

RubenKelevra · 2020-04-20T18:08:39Z

Thanks for the update @RubenKelevra!

Just so we're on the same page here, I'm assuming delta -15 is zstd --patch-from before this PR was merged, delta -15 --long is zstd--patch-from after this PR was merged and full update is just zstd -15 without patch-from?

Nope, it's all done with the same version (the quoted commit). If you like to have a comparison between this version and a different one I can do this as well.

Which commit should I choose for a comparison? :)

Cyan4973 · 2020-04-20T18:25:20Z

That's unexpected. Shouldn't --long be automatically enabled when input size is large enough, which several of these examples definitely are ?

bimbashrestha · 2020-04-20T19:14:27Z

Nope, it's all done with the same version (the quoted commit)

Hmm yeah like @Cyan4973 mentioned, long mode should be automatically activated when the input is large enough. Is delta a script that just invokes the zstd cli with and without the --long option?

RubenKelevra · 2020-04-21T05:31:45Z

Nope, it's all done with the same version (the quoted commit)

Hmm yeah like @Cyan4973 mentioned, long mode should be automatically activated when the input is large enough. Is delta a script that just invokes the zstd cli with and without the --long option?

No, I wasn't doing it that fancy, but you can think of it like this. ;)

I just didn't want to clutter my post with all those filenames :)

The --long parameter is definitely not set automatically at the moment (at least in my scenario).

bimbashrestha · 2020-04-22T01:46:12Z

@RubenKelevra @Cyan4973 Ah I see the issue. It's this line here.

zstd/programs/fileio.c

Line 810 in 38a6d2a

if (fileWindowLog > ZSTD_cycleLog(comprParams->hashLog, cParams.strategy)) {

That should be cParams.hashLog and not comprParams->hashLog. We had a different heuristic that used chainLog before so I never tested to ensure that long mode was properly triggered when we use ZSTD_cycleLog.

Glad we caught that. I'll put up a PR and a test fixing it shortly.

facebook-github-bot added the CLA Signed label Jan 18, 2020

Cyan4973 assigned bimbashrestha Jan 21, 2020

bimbashrestha requested a review from terrelln January 23, 2020 19:10

bimbashrestha changed the title ~~Adding --long support for --patch-from on --single-thread~~ Adding --long support for --patch-from Feb 5, 2020

conversion to size_t

14e075d

Cyan4973 approved these changes Apr 9, 2020

View reviewed changes

terrelln reviewed Apr 10, 2020

View reviewed changes

tests/fuzzer.c Outdated Show resolved Hide resolved

terrelln reviewed Apr 10, 2020

View reviewed changes

lib/compress/zstd_compress.c Outdated Show resolved Hide resolved

Bimba Shrestha added 2 commits April 15, 2020 04:24

making sure ldm is at least 10x better

c6e9d7f

removing break

2177a0d

terrelln reviewed Apr 15, 2020

View reviewed changes

programs/fileio.c Outdated Show resolved Hide resolved

Bimba Shrestha added 3 commits April 16, 2020 10:42

including zstd_compress_internal and removing redundant macros

aa17451

exposing ZSTD_cycleLog()

92a9775

using cycleLog instead of chainLog

fc4844a

terrelln approved these changes Apr 17, 2020

View reviewed changes

add some more docs about user optimizations

7b113a1

Cyan4973 reviewed Apr 17, 2020

View reviewed changes

programs/fileio.c Outdated Show resolved Hide resolved

Cyan4973 approved these changes Apr 17, 2020

View reviewed changes

formatting

a18409c

bimbashrestha merged commit 5b0a452 into facebook:dev Apr 17, 2020

This was referenced Apr 17, 2020

Add built-in support for bi-directional deltas in zstd --patch-from #2063

Closed

idea/request for help: zstd as a binary diff producer/patcher? #1935

Closed

bimbashrestha mentioned this pull request Apr 22, 2020

[bug] passing cParams.hashLog instead of comprParams->hashLog for ldm trigger #2086

Merged

stephan57160 mentioned this pull request Sep 7, 2021

Minor TYPO in ztest build/test process #2768

Closed

Adding --long support for --patch-from #1959

Adding --long support for --patch-from #1959

Conversation

bimbashrestha commented Jan 18, 2020 • edited Loading

bimbashrestha commented Jan 18, 2020

Cyan4973 commented Jan 18, 2020 • edited Loading

bimbashrestha commented Jan 18, 2020

Cyan4973 commented Jan 18, 2020

bimbashrestha commented Jan 22, 2020

bimbashrestha commented Jan 22, 2020

Cyan4973 commented Jan 22, 2020 • edited Loading

bimbashrestha commented Jan 22, 2020 • edited Loading

Cyan4973 commented Jan 22, 2020

bimbashrestha commented Jan 23, 2020

Cyan4973 commented Jan 24, 2020

terrelln commented Jan 31, 2020

bimbashrestha commented Feb 3, 2020

Cyan4973 commented Feb 3, 2020

terrelln commented Feb 3, 2020

bimbashrestha commented Feb 3, 2020

bimbashrestha commented Feb 4, 2020

Cyan4973 commented Feb 4, 2020

bimbashrestha commented Feb 4, 2020

Cyan4973 commented Feb 4, 2020

bimbashrestha commented Feb 5, 2020

terrelln commented Feb 5, 2020 • edited Loading

terrelln commented Feb 5, 2020

Cyan4973 commented Feb 5, 2020 • edited Loading

bimbashrestha commented Feb 5, 2020

terrelln commented Feb 5, 2020

terrelln left a comment

Choose a reason for hiding this comment

Cyan4973 commented Apr 17, 2020

bimbashrestha commented Apr 17, 2020

bimbashrestha commented Apr 17, 2020

RubenKelevra commented Apr 19, 2020 • edited Loading

Observations

Conclusion

bimbashrestha commented Apr 20, 2020

RubenKelevra commented Apr 20, 2020

Cyan4973 commented Apr 20, 2020

bimbashrestha commented Apr 20, 2020

RubenKelevra commented Apr 21, 2020

bimbashrestha commented Apr 22, 2020

bimbashrestha commented Jan 18, 2020 •

edited

Loading

Cyan4973 commented Jan 18, 2020 •

edited

Loading

Cyan4973 commented Jan 22, 2020 •

edited

Loading

bimbashrestha commented Jan 22, 2020 •

edited

Loading

terrelln commented Feb 5, 2020 •

edited

Loading

Cyan4973 commented Feb 5, 2020 •

edited

Loading

RubenKelevra commented Apr 19, 2020 •

edited

Loading