Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding --long support for --patch-from #1959

Merged
merged 34 commits into from
Apr 17, 2020

Conversation

bimbashrestha
Copy link
Contributor

@bimbashrestha bimbashrestha commented Jan 18, 2020

Patch From
Zstandard is introducing a new command line option —patch-from= which leverages our existing compressors, dictionaries and the long range match finder to deliver a high speed engine for producing and applying patches to files.

Patch from increases the previous maximum limit for dictionaries from 32 MB to 2 GB. Additionally, it maintains fast speeds on lower compression levels without compromising patch size by using the long range match finder (now extended to find dictionary matches). By default, Zstandard uses a heuristic based on file size and internal compression parameters to determine when to activate long mode but it can also be manually specified as before.

Patch from also works with multi-threading mode at a minimal compression ratio loss vs single threaded mode.

Example usage:

# create the patch
zstd --patch-from=<oldfile> <newfile> -o <patchfile>

# apply the patch
zstd -d --patch-from=<oldfile> <patchfile> -o <newfile>

Benchmarks:
We compared zstd to bsdiff, a popular industry grade diff engine. Our testing data were tarballs of different versions of source code from popular GitHub repositories. Specifically

repos = {
    # ~31mb (small file)
    "zstd": {"url": "https://github.com/facebook/zstd", "dict-branch": "refs/tags/v1.4.2", "src-branch": "refs/tags/v1.4.3"},
    # ~273mb (medium file)
    "wordpress": {"url": "https://github.com/WordPress/WordPress", "dict-branch": "refs/tags/5.3.1", "src-branch": "refs/tags/5.3.2"},
    # ~1.66gb (large file)
    "llvm": {"url": "https://github.com/llvm/llvm-project", "dict-branch": "refs/tags/llvmorg-9.0.0", "src-branch": "refs/tags/llvmorg-9.0.1"}
}

alt text
Patch from on level 19 (with chainLog=30 and targetLength=4kb) remains competitive with bsdiff when comparing patch sizes.
alt text
And patch from greatly outperforms bsdiff in speed even on its slowest setting of level 19 boasting an average speedup of ~7X. Patch from is >200X faster on level 1 and >100X faster (shown below) on level 3 vs bsdiff while still delivering patch sizes less than 0.5% of the original file size.
alt text

And of course, there is no change to the fast zstd decompression speed.

@bimbashrestha
Copy link
Contributor Author

Tested this using tar of 4 clang versions:
Before

head -c 210000000 clangs > dict && head -c 200000000 clangs > data && ./zstd -f --memory=210000000 --single-thread --patch-from=dict data
data                 : 16.14%   (200000000 => 32275377 bytes, data.zst)

After

head -c 210000000 clangs > dict && head -c 200000000 clangs > data && ./zstd --long -f --memory=210000000 --single-thread --patch-from=dict data
data                 :  7.77%   (200000000 => 15542116 bytes, data.zst)

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 18, 2020

I presume that, in this scenario, the difference is only the last 10.000.000 bytes.
And since data is actually smaller than dict, it should consist entirely of references into dict.

We can't avoid spending a few bytes per block, at least to indicate one large match,
but such a budget should be < 30 bytes per block. That's still much less than the resulting 15MB in the test.

Hence, there is ground for improvement.

One first question :
--long, by default, set the windowLog to 27, 128 MB.
Since the size of data to compress is 200.000.000 bytes, it requires a larger windowLog, of 28.
I presume the --patch-from=FILE command does this adaptation automatically ?

@bimbashrestha
Copy link
Contributor Author

I presume the --patch-from=FILE command does this adaptation automatically ?

Yes, see: https://github.com/bimbashrestha/zstd/blob/a2917fe814a3c9f41b1723de8a26faaff28bfd7f/programs/fileio.c#L821

@Cyan4973
Copy link
Contributor

OK,
let's spend some time on this scenario next week.
I think it's a simple one, it's relatively simple to define an "optimal" behavior for it,
and we should aim at getting it right, even if it demands new adaptations to work properly.

@bimbashrestha
Copy link
Contributor Author

Tried this with a more likely scenario. The input file is just the dictionary with 10% of its bytes perturbed by +-5 and the gains from --long are basically non-existent.

Generating file for compression

import random
clangs = open("clangs", "rb").read(); clangs_str = clangs.decode('latin-1')

# randomly add int in range (-5, 5) inclusive to the original clangs on 10% of the bytes
clangs_mod_str = "".join(chr(ord(a)+random.randrange(-5,5)) if random.random() > 0.9 and ord(a)+5 < 256 and ord(a)-5 >= 0 else a for a in clangs_str)

clangs_mod = clangs_mod_str.encode('latin-1'); open("clangs_mod", "wb").write(clangs_mod)

With --long

./zstd --single-thread -f --memory=300000000 --long --patch-from=clangs clangs_mod && ./zstd -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 32.04%   (288859136 => 92544650 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

Without --long

./old --single-thread -f --memory=300000000 --patch-from=clangs clangs_mod && ./old -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 32.07%   (288859136 => 92651457 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

@bimbashrestha
Copy link
Contributor Author

It's better with less perturbations but still not super stellar. Here, the input file is just the dictionary with 1% of the bytes perturbed by 1 or -1.

Generation:

import random
clangs = open("clangs", "rb").read(); clangs_str = clangs.decode('latin-1')

# randomly add  1 or -1 to the original clangs on 1% of the bytes
clangs_mod_str = "".join(chr(ord(a)+random.choice([1, -1])) if random.random() < 0.01 and ord(a)+5 < 256 and ord(a)-5 >= 0 else a for a in clangs_str)

# check what percentage is different 
print(sum(a != b for a, b in zip(clangs_str, clangs_mod_str))/len(clangs_str))

clangs_mod = clangs_mod_str.encode('latin-1'); open("clangs_mod", "wb").write(clangs_mod)

With --long

./zstd --single-thread -f --memory=300000000 --long --patch-from=clangs clangs_mod && ./zstd -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 9.26%   (288859136 => 92544650 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

Without --long

./old --single-thread -f --memory=300000000 --patch-from=clangs clangs_mod && ./old -f -d --memory=300000000 --patch-from=clangs clangs_mod.zst -o tmp && diff tmp clangs_mod
clangs_mod           : 17.95%  (288859136 => 92651457 bytes, clangs_mod.zst)
clangs_mod.zst       : 288859136 bytes

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 22, 2020

These scenarios are actually less representative of target use case.

In no scenario does it make sense to randomly change one byte here and there, especially at such density (yes, even 1/100 is a lot).

The expected scenario is that entire sections of reference will be preserved unmodified into target. target differs from reference by having : entire sections removed, and/or entire sections added, in no particular order nor position. Of course, a few random changes may also be present here and there, but sparsely (if they are dense, it's equivalent to a complete section change).

With this definition, the very first scenario was actually closer to the goal.
But there is a twist : this particular scenario was built from aggregation of 4 similar tarball, meaning internal repetitions are already very high. We'll see together that this is an issue.

@bimbashrestha
Copy link
Contributor Author

bimbashrestha commented Jan 22, 2020

Hmm I see. Maybe a better scenario would be tar(clang x, clang x+2, clang x+4, clang x+6) as dict and tar(clang x+1, clang x+3, clang x+5, clang x+7) as data?

Edit: I guess high data to data redundancy would still be an issue here. I guess we could manufacture something with low long distance data-data redundancy and high long distance data-dict redundancy.

Anyway, we can chat in person more when you're back

@Cyan4973
Copy link
Contributor

nope, still the same issue. A better scenario would be clang x as reference, and clang x+1 as target.

@bimbashrestha
Copy link
Contributor Author

As seen in the commit above, I made a stupid mistake:( I was populating the ldm hash table with the src, not the dictionary. Just an FYI. The results still don't change much though (probably because the dictionary and src have been nearly identical in the tests above). I ran another with subsequent versions of clang using the default parameters from the commit above.

tar cvf data ~/Downloads/clangs/llvm-9.0.1.src
tar cvf dict ~/Downloads/clangs/llvm-9.0.0.src

# with --long
./zstd --single-thread -f --memory=500000000 --long --patch-from=dict data
data                 : 11.72%   (425910272 => 49898506 bytes, data.zst)

# without --long
./old --single-thread -f --memory=500000000 --patch-from=dict data
data                 : 11.72%   (425910272 => 49905559 bytes, data.zst)

# without dictionary 
./zstd -f --long data
data                 : 11.31%   (425910272 => 48162576 bytes, data.zst)

I think the implementation is doing what we would like it to now. Can someone give it a look and check to see if I'm missing something obvious?

@Cyan4973
Copy link
Contributor

Forces --long when using --patch-from on dictionaries larger than 128 mb

--long will likely be needed way before that point.
But it's more complex, as the threshold depends on the compression parameters.

For example, at level 1, the hash table contains only 16K positions, so it will saturate very quickly (likely in the ~100KB range, in any case, way before 128 MB). --long will be able to compensate beyond saturation point.

However, this is a bit more complex to setup, as it now requires to find a rule depending on other compression parameters. This could be investigated later, in a dedicated optimization pass.

@terrelln
Copy link
Contributor

This looks about correct, but it isn't working so there must be something wrong. I think it is really close to working, and if you fix a few bugs it'll be there.

@bimbashrestha
Copy link
Contributor Author

Looks like there were a few things missing. The main thing wrong was that I was computing the rolling hash starting at src when I should have been computing it starting at window.base. I also needed to increase the window log (to contain both dictionary and src) and update the window inside ldmstate when loading the dictionary.

Anyway, the results are much better now.

tar cf llvm-9.0.1 ~/Downloads/clangs/llvm-9.0.1.src
tar cf llvm-9.0.0 ~/Downloads/clangs/llvm-9.0.0.src

./zstd --single-thread --memory=500000000 --long --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           :  0.48%   (425910272 => 2062253 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst       : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 3, 2020

I also needed to increase the window log (to contain both dictionary and src)

FYI, according to the zstd specification, it's enough for the current position in src to be < windowSize in order to access the entire dictionary.
In a silly case, you could have srcSize == 16 KB , therefore windowLog == 14, and it would still give access to the entire dictionary, even if it's 128 MB long, and therefore could generate offset way larger than windowSize.

This part of the specification is well respected by the "regular" match finder and the decoder.

However, the LDM has not been combined with dictionary so far, therefore it may not be aware of this capability, and as a consequence, may not be able to reference matches beyond windowSize. In which case, a work-around is indeed necessary.

@terrelln
Copy link
Contributor

terrelln commented Feb 3, 2020

However, the LDM has not been combined with dictionary so far, therefore it may not be aware of this capability, and as a consequence, may not be able to reference matches beyond windowSize. In which case, a work-around is indeed necessary.

We should be able to fix it by respecting loadedDictEnd in the LDM match finder. We shouldn't increase the window log to workaround this issue, since that will increase the decompression memory usage.

@bimbashrestha
Copy link
Contributor Author

TODO: make dictionary ldm respect loaddictendl like @terrelln mentioned above.

Also linking the original thread here because I forgot to before: #1935

@bimbashrestha
Copy link
Contributor Author

Okay, I think this is ready for a review now. The functionality is still the same after the last few commits:

./zstd --single-thread --memory=500000000 --long -f --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 -f --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           :  0.48%   (425910272 => 2062279 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 4, 2020

Is --single-thread still required ?

@bimbashrestha
Copy link
Contributor Author

Is --single-thread still required ?

Without it, the current implementation doesn't compress exceptionally well. See below. It's still better than not using long mode but hardly. I'll have to look into how long mode works together with multi-threading to uncover why this is. I was thinking that could be part of a different pr?

# without single-thread
./zstd --memory=500000000 -f --long --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 -f --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           : 11.30%   (425910272 => 48130136 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

# --single-thread without --long
./zstd --memory=500000000 -f --single-thread --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d --memory=500000000 -f --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1

llvm-9.0.1           : 11.72%   (425910272 => 49905685 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 4, 2020

I was thinking that could be part of a different pr?

Okay, that's fine.
Just keep in mind : at the end of this serie of patches, the resulting command must be "simple" to understand and invoke from an end-user perspective.
--patch-from= is.
--memory=# --long --single-thread --patch-from= isn't.

All the additional machinery must be simplified.

Special exception if we have good reasons to believe that some consequences shouldn't be transparently opt-in, such as huge memory usage (and then, it should be correctly documented).

@bimbashrestha
Copy link
Contributor Author

Just keep in mind : at the end of this serie of patches, the resulting command must be "simple" to understand and invoke from an end-user perspective.

Yes, I believe --memory should be manually specified. Although, I think we should bump the default max to something larger than it is now. Maybe 1gb? Which would get rid of it for most use cases right?

Maybe we can get rid of having the end user specify --long by automatically using long by default on large files (>128mb)? The user can then override this by setting --long=0 or something (although I don't think many will want to do this)

Also I could just automatically turn off multithreading on patch from and notify the user instead of requiring that they explicitly input --single-thread until i get multithreaded working?

With those changes, the cli should just be --patch-from=FILE for most use cases.

@terrelln
Copy link
Contributor

terrelln commented Feb 5, 2020

Yes, I believe --memory should be manually specified. Although, I think we should bump the default max to something larger than it is now. Maybe 1gb? Which would get rid of it for most use cases right?

I think I would be okay having --patch-from=X imply --memory=FileSize and --long when the file size is known and large. I do not want to increase the default --memory. That controls the maximum memory usage of the decoder in the standard case. 128 MB is already quite big. When users are using --patch-from=X they are getting what they pay for if we increase the memory usage, and it only affects them. Though we should note it in the help/man page.

Also I could just automatically turn off multithreading on patch from and notify the user instead of requiring that they explicitly input --single-thread until i get multithreaded working?

Multithreading support should be easy to add. You just need to load the dictionary into its ldmState in the same way you are doing it for the single-threaded one now. Until that diff is landed we can fail if --single-thread isn't passed to the CLI, since we don't expect it to make it into a release.

@terrelln
Copy link
Contributor

terrelln commented Feb 5, 2020

Alternatively, for files larger than 128 MB, we could print a warning saying to use --memory=X to ensure the dictionary is always in-bounds, but still compress the file with reduced compression ratio.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 5, 2020

I think I would be okay having --patch-from=X imply --memory=FileSize

I agree.
Maybe a (non-blocking) warning message (displayLevel 2) would also be worthwhile when the memory required becomes very large, such as > 128 MB.

Maybe we can get rid of having the end user specify --long by automatically using long by default on large files (>128mb)?

Yes, although part of the game will be to determine what's the right threshold to trigger it.
I expect it to be far smaller than 128 MB. Any file substantially larger than cycleLog is at risk of missing pointers from the early section of the reference files.
I guess it will be useful to make tests with intermediate sizes, and determine when it's possible to not use --long mode.

Multithreading support should be easy to add.

Maybe @bimbashrestha will need some directions on this topic. You seem to have a pretty clear idea of what to do.

@bimbashrestha
Copy link
Contributor Author

Okay then, I made --patch-from=file imply --memory=filesize (that removes --memory) and I put 32 mb as the threshold for automatically activating long mode in patch-from (that removes --long) We can experiment with the value for the threshold later.

Multithreading support should be easy to add. You just need to load the dictionary into its ldmState in the same way you are doing it for the single-threaded one now. Until that diff is landed we can fail if --single-thread

Didn't realize that was all we needed to do. I've just went ahead and included the multi-threaded changes in this pr. The ratio in multi-threaded mode is a little worse than in single-threaded mode though. Is this unusual?

# --single-thread
./zstd --single-thread --patch-from=llvm-9.0.0 llvm-9.0.1
llvm-9.0.1           :  0.48%   (425910272 => 2062279 bytes, llvm-9.0.1.zst)

# vs no --single-thread
./zstd --patch-from=llvm-9.0.0 llvm-9.0.1 \
  && ./zstd -d  --patch-from=llvm-9.0.0 llvm-9.0.1.zst -o tmp \
  && diff -s tmp llvm-9.0.1
llvm-9.0.1           :  0.55%   (425910272 => 2337712 bytes, llvm-9.0.1.zst)
llvm-9.0.1.zst      : 425910272 bytes
Files tmp and llvm-9.0.1 are identical

@bimbashrestha bimbashrestha changed the title Adding --long support for --patch-from on --single-thread Adding --long support for --patch-from Feb 5, 2020
@terrelln
Copy link
Contributor

terrelln commented Feb 5, 2020

Didn't realize that was all we needed to do. I've just went ahead and included the multi-threaded changes in this pr. The ratio in multi-threaded mode is a little worse than in single-threaded mode though. Is this unusual?

This is probably expected, but I can help you investigate this tomorrow for learning and to double check. Basically we'll want to look at the sequences the single threaded encoder and the multithreaded encoder generate, and compare them.

tests/fuzzer.c Outdated Show resolved Hide resolved
programs/fileio.c Outdated Show resolved Hide resolved
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

programs/fileio.c Outdated Show resolved Hide resolved
@Cyan4973
Copy link
Contributor

Question :
we mentioned a need for a paragraph describing the feature, to be used at release time.
Where is it ?

@bimbashrestha
Copy link
Contributor Author

I'll out up the blurb shortly:) Just need to also finish making a nice graph or two

@bimbashrestha bimbashrestha merged commit 5b0a452 into facebook:dev Apr 17, 2020
@bimbashrestha
Copy link
Contributor Author

Alright finally merged!:)) Also updated the PR description with the blurb @Cyan4973 and @terrelln

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Apr 19, 2020

Hey guys, hey @bimbashrestha,

here's the requested look at the --long parameter. I also compared a full update with a compressed file vs a delta update with various real-world data.

While I'm actually more interested in patching backward, reasoning[0], this test is done forwards (which shouldn't matter much anyway).

My files are:

  1. OpenStreetMap XML data for Luxembourg with a 2 day difference (Filesize ~390 MB)
  2. OpenStreetMap ProtocolBufferFormat data for Luxembourg with a 2 day difference (Filesize ~25 MB)
  3. OpenStreetMap ProtocolBufferFormat data for Australia & Oceania with a 1 month difference (Filesize ~766 MB)
  4. OpenStreetMap ProtocolBufferFormat data for Australia & Oceania with a 1 day difference (Filesize ~766 MB)
  5. ArchLinux x86_64 linux kernel package 5.6.4 to 5.6.5 uncompressed (Filesize ~80 MB)
  6. ArchLinux x86_64 go-pie package 1.14.1 to 1.14.2 uncompressed (Filesize ~641 MB)
  7. CentOS 7 x86_64 GenericCloud qcow2-image with version hop 1905 to 1907 (Filesize ~897 MB)
  8. en wikipedia news articles (pages) as xml one month difference (Filesize ~195 MB)

I'm running zstd build from the git version 5b0a452, compiled with -march=native -O2, running on an 'i3-2330M'. Data is stored and loaded on a SATA-SSD and the system has 12 GB memory.

1a) delta -15:

luxembourg-200417.osm :  0.88%   (410250506 => 3628882 bytes, luxembourg-200415-to-200417.osm.zst_patch)
real    1m53,091s
user    1m51,593s
sys     0m1,070s

1b) delta -15 --long:

luxembourg-200417.osm :  0.04%   (410250506 => 169502 bytes, luxembourg-200415-to-200417.osm.zst_patch)
real    1m57,944s
user    1m56,030s
sys     0m1,282s

1c) full update -15 --long:

41640286 bytes -> delta saves 41470784 bytes (99.59%)

2a) delta -15:

luxembourg-200417.osm.pbf : 98.79%   (25349898 => 25043001 bytes, luxembourg-200415-to-200417.osm.pbf.zst_patch)
real    0m5,329s
user    0m5,158s
sys     0m0,161s

2b) delta -15 --long:

luxembourg-200417.osm.pbf : 98.72%   (25349898 => 25025611 bytes, luxembourg-200415-to-200417.osm.pbf.zst_patch)
real    0m5,775s
user    0m5,636s
sys     0m0,133s

2c) full update -15 --long:

25350493 bytes -> delta saves 324882 bytes (1.28%)

3a) delta -15:

australia-oceania-200401.osm.pbf :100.00%   (803827262 => 803820669 bytes, australia-oceania-200301-to-200401.osm.pbf.zst_patch)
real    3m23,253s
user    3m9,292s
sys     0m3,464s

3b) delta -15 --long:

australia-oceania-200401.osm.pbf : 99.99%   (803827262 => 803737958 bytes, australia-oceania-200301-to-200401.osm.pbf.zst_patch)
real    3m44,705s
user    3m26,232s
sys     0m4,313s

3c) full update -15 --long:

803820673 bytes -> delta saves 82715 bytes (0.01%)

4a) delta -15:

australia-oceania-200417.osm.pbf :100.00%   (807853291 => 807844053 bytes, australia-oceania-200416-to-200417.osm.pbf.zst_patch)
real    3m21,324s
user    3m10,312s
sys     0m3,081s

4b) delta -15 --long:

australia-oceania-200417.osm.pbf : 99.59%   (807853291 => 804520667 bytes, australia-oceania-200416-to-200417.osm.pbf.zst_patch)
real    3m36,696s
user    3m27,023s
sys     0m3,033s

4c) full update -15 --long:

807843840 bytes -> delta saves 3323173 bytes (0.41%)

5a) delta -15:

linux-5.6.5.arch1-1-x86_64.pkg.tar : 86.53%   (83640320 => 72370676 bytes, linux-5.6.4-to-5.6.5.arch1-1-x86_64.pkg.tar.zst_patch)
real    0m20,083s
user    0m19,092s
sys     0m0,517s

5b) delta -15 --long:

linux-5.6.5.arch1-1-x86_64.pkg.tar : 86.33%   (83640320 => 72204691 bytes, linux-5.6.4-to-5.6.5.arch1-1-x86_64.pkg.tar.zst_patch)
real    0m20,919s
user    0m20,391s
sys     0m0,421s

5c) full update -15 --long:

72891396 bytes -> delta saves 686705 bytes (0.94%)

6a) delta -15:

go-pie-2_1.14.2-1-x86_64.pkg.tar : 12.51%   (672358400 => 84082084 bytes, go-pie-2_1.14.1-to-1.14.2-1-x86_64.pkg.tar.zst_patch)
real    5m12,541s
user    5m6,284s
sys     0m2,869s

6b) delta -15 --long:

go-pie-2_1.14.2-1-x86_64.pkg.tar :  8.34%   (672358400 => 56105936 bytes, go-pie-2_1.14.1-to-1.14.2-1-x86_64.pkg.tar.zst_patch)
real    4m55,772s
user    4m52,609s
sys     0m1,777s

6c) full update -15 --long:

156160084 bytes -> delta saves 100054148 bytes (64.07%)

7a) delta -15:

CentOS-7-x86_64-GenericCloud-1907.qcow2 : 27.55%   (942407680 => 259600155 bytes, CentOS-7-x86_64-GenericCloud-1905-to-1907.qcow2.zst_patch)
real    5m43,531s
user    5m34,114s
sys     0m3,793s

7b) delta -15 --long:

CentOS-7-x86_64-GenericCloud-1907.qcow2 : 15.18%   (942407680 => 143012290 bytes, CentOS-7-x86_64-GenericCloud-1905-to-1907.qcow2.zst_patch)
real    4m9,574s
user    4m4,923s
sys     0m3,366s

7c) full update -15 --long:

277690047 bytes -> delta saves 134677757 bytes (48.50%)

8a) delta -15:

enwikinews-20200401-pages-articles.xml :  4.53%   (204994714 => 9285772 bytes, enwikinews-20200301-to-20200401-pages-articles.xml.zst_patch)
real    2m25,614s
user    2m14,190s
sys     0m2,773s

8b) delta -15 --long:

enwikinews-20200401-pages-articles.xml :  0.80%   (204994714 => 1633204 bytes, enwikinews-20200301-to-20200401-pages-articles.xml.zst_patch)
real    1m44,225s
user    1m40,599s
sys     0m0,909s

8c) full update -15 --long:

39571286 bytes -> delta saves 37938082 bytes (95.87%)

Observations

OpenStreetMap data is sorted by ID, so new IDs appear at the end, changes don't move the data in its position, just change their size, while deletes just removes the data.

I suspect that the XML from Wikipedia is also sorted by key, but haven't confirmed that. But it contains a lot of large text chunks of static data.

go-pie either holds a lot of static files, which doesn't change much or the compilation creates very stable binaries.

The Linux kernel is hard to compress and extremely hard to diff - the second might be caused by an unstable compilation process?

qcow2 images are particularly good to diff when there are low amounts of changes (lts distributions)

The binary protocol buffer format (and probably also the export process) makes it nearly impossible to find unchanged data, therefore the diff barely makes an impact.

Conclusion

It looks like the worst case is the same size as a regular compressed full update.

If there's enough memory --long should always be used, since it barley have impacts on the compression speed, but have a huge potential to decrease the patch size.

Btw: Any chances to lift the uint32 byte limit for the file size?

[0] #2063 (comment)

@bimbashrestha
Copy link
Contributor Author

Thanks for the update @RubenKelevra!

Just so we're on the same page here, I'm assuming delta -15 is zstd --patch-from before this PR was merged, delta -15 --long is zstd--patch-from after this PR was merged and full update is just zstd -15 without patch-from?

@RubenKelevra
Copy link
Contributor

Thanks for the update @RubenKelevra!

Just so we're on the same page here, I'm assuming delta -15 is zstd --patch-from before this PR was merged, delta -15 --long is zstd--patch-from after this PR was merged and full update is just zstd -15 without patch-from?

Nope, it's all done with the same version (the quoted commit). If you like to have a comparison between this version and a different one I can do this as well.

Which commit should I choose for a comparison? :)

@Cyan4973
Copy link
Contributor

That's unexpected. Shouldn't --long be automatically enabled when input size is large enough, which several of these examples definitely are ?

@bimbashrestha
Copy link
Contributor Author

Nope, it's all done with the same version (the quoted commit)

Hmm yeah like @Cyan4973 mentioned, long mode should be automatically activated when the input is large enough. Is delta a script that just invokes the zstd cli with and without the --long option?

@RubenKelevra
Copy link
Contributor

Nope, it's all done with the same version (the quoted commit)

Hmm yeah like @Cyan4973 mentioned, long mode should be automatically activated when the input is large enough. Is delta a script that just invokes the zstd cli with and without the --long option?

No, I wasn't doing it that fancy, but you can think of it like this. ;)

I just didn't want to clutter my post with all those filenames :)

The --long parameter is definitely not set automatically at the moment (at least in my scenario).

@bimbashrestha
Copy link
Contributor Author

@RubenKelevra @Cyan4973 Ah I see the issue. It's this line here.

if (fileWindowLog > ZSTD_cycleLog(comprParams->hashLog, cParams.strategy)) {

That should be cParams.hashLog and not comprParams->hashLog. We had a different heuristic that used chainLog before so I never tested to ensure that long mode was properly triggered when we use ZSTD_cycleLog.

Glad we caught that. I'll put up a PR and a test fixing it shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants