Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

Closed
iksaif opened this issue Sep 18, 2023 · 8 comments
Closed
Assignees

Comments

@iksaif
Copy link

iksaif commented Sep 18, 2023

Describe the bug
It seems that #3449 introduced a performance regression on modern CPUs, this is particularly problematic on aarch64/arm64 since those don't have an ASM function and will always use the fast loop functions.

To Reproduce

wget https://github.com/facebook/zstd/releases/download/v1.1.3/github_users_sample_set.tar.zst
zstd -d github_users_sample_set.tar.zst
tar xf github_users_sample_set.tar
wget https://mattmahoney.net/dc/silesia.zip
unzip silesia.zip
tar -cf silesia.tar dickens mozilla mr nci ooffice osdb reymont samba sao webster x-ray xml

CFLAGS="-O2 -march=native" make -j8; ./zstd -b1 -r github ; ./zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar

On Darwin: (Apple clang version 14.0.3 (clang-1403.0.22.14.1))
On Linux: (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0)

Expected behavior
Performance should not degrade between v1.5.2 and v1.5.4

Screenshots and charts

  • b1: zstd -b1 -r github

  • b2: zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar

  • 🟢 - best performances for this machine

  • 🔴 - worst performances for this machine

Machine Version github silesia
m1 v1.5.2 🟢 529.5 MB/s 🔴 722.5 MB/s
m1 v1.5.4 🔴 412.3 MB/s 🟢 723.4 MB/s
m1 v1.5.4 + disableFast 🟢 529.4 MB/s 🟢 722.4 MB/s
m1 dev 🔴 410.6 MB/s 🟢 722.7 MB/s
c7g Graviton 3 51b123d 🟢 483.1 MB/s 🟢 1128.2 MB/s
c7g v1.5.2 🟢 436.1 MB/s 🟢 1122.5 MB/s
c7g v1.5.4 🔴 349.6 MB/s 🔴 703.7 MB/s
c7g v1.5.4 + disableFast 🟢 436.7 MB/s 🟢 1125.5 MB/s
c7g dev 🔴 349.3 MB/s 🔴 720.4 MB/s
c6i Intel(R) Xeon(R) Platinum 8375C 51b123d (pre Huffman ASM) 🟢 492.4 MB/s 🟠 1232.3 MB/s
c6i v1.5.2 🔴 393.4 MB/s 🟢 1589.8 MB/s
c6i v1.5.4 🔴 398.7 MB/s 🟢 1579.3 MB/s
c6i v1.5.4 + disableAsm 🔴 404.0 MB/s 🔴 907.0 MB/s
c6i v1.5.4 + disableAsm + disableFast 🟢 494.3 MB/s 🟠 1235.2 MB/s
c6i dev 🔴 397.4 MB/s 🟢 1585.9 MB/s
c6a AMD EPYC 7R13 51b123d (pre Huffman ASM) 🟢 519.3 MB/s 🟠 1209.2 MB/s
c6a v1.5.2 🔴 397.9 MB/s 🟢 1467.3 MB/s
c6a v1.5.4 🔴 397.6 MB/s 🟢 1466.8 MB/s
c6a v1.5.4 + disableAsm 🔴 396.8 MB/s 🔴 687.4 MB/s
c6a v1.5.4 + disableAsm + disableFast 🟢 476.0 MB/s 🟠 1301.4 MB/s
c6a v1.5.5 🔴 396.1 MB/s 🟢 1469.0 MB/s
c6a dev 🔴 391.8 MB/s 🟢 1466.3 MB/s

Summary:

  • v1.5.4 + disableAsm always gives the worst performances for both tests - and this versions uses the Generic C fast loops
  • v1.5.4 + disableAsm + disableFast: is slightly slower that the ASM function for the second benchmark but faster for the first one
  • For aarch64 the first benchmark was always faster with v1.5.2
  • For amd64 the first benchmark was faster before the ASM function
  • [huf] Add generic C versions of the fast decoding loops #3449 results can be replicated only on Apple silicon with the second benchmark

Desktop (please complete the following information):
See above

@iksaif
Copy link
Author

iksaif commented Sep 18, 2023

CC: @terrelln since you authored the original PR you are probably interested about this

I haven't looked at why things might be slower, but given the results it might be interesting to offer the option to not build the generic C versions of the fast decoding loops since it's unclear that they offer a significant performance boost on modern CPUs and compilers

Also maybe I did something wrong in those tests, I'm happy to re-run them if necessary

@iksaif iksaif changed the title Performance regressions on some CPUs after #3449 (C fast loops) Possible performance regressions on some CPUs after #3449 (C fast loops) Sep 18, 2023
@iksaif
Copy link
Author

iksaif commented Sep 18, 2023

This also relates to:

iksaif added a commit to iksaif/zstd that referenced this issue Sep 19, 2023
facebook#3762 seems to show
that it doesn't perform as well as we though it would in many
cases. It makes sense to at least allow users to disable them
at buildtime and runtime.
@terrelln terrelln self-assigned this Sep 28, 2023
@terrelln
Copy link
Contributor

Thanks for the report @iksaif!

I will spend some time investigating next week.

At first glance, the performance on c7g Graviton 3 looks bad across the board. But, everything else has just regressed the GitHub case, which is a bunch of small files. Which maybe suggests that we need to either improve the Fast-C & ASM loops for small literals sections, or automatically use the old code versions for small literals sections.

@iksaif
Copy link
Author

iksaif commented Oct 14, 2023

👋 any finding ? thanks!

terrelln pushed a commit to terrelln/zstd that referenced this issue Nov 20, 2023
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because `entry.nbBits` is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook#3762. So if even after this change, there is a
performance regression, users can opt-out at compile time.
terrelln pushed a commit to terrelln/zstd that referenced this issue Nov 20, 2023
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
 is 0. Initially, I thought that when it got towards the end, the
 computation of how many iterations of safe might become expensive. But
 it ends up being slower to have to decode each of the 4 streams
 individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |
terrelln pushed a commit that referenced this issue Nov 20, 2023
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because `entry.nbBits` is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue #3762. So if even after this change, there is a
performance regression, users can opt-out at compile time.
terrelln pushed a commit to terrelln/zstd that referenced this issue Nov 20, 2023
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
 is 0. Initially, I thought that when it got towards the end, the
 computation of how many iterations of safe might become expensive. But
 it ends up being slower to have to decode each of the 4 streams
 individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |
@terrelln
Copy link
Contributor

Hi @iksaif sorry for the delay, but I have several updates:

  1. PR [huf] Improve fast C & ASM performance on small data #3827 drastically speeds up the fast C & assembly Huffman decoders on very small data like in the github corpus.
  2. PR [huf] Improve fast huffman decoding speed in linux kernel #3826 may improve c7g Graviton 3 performance. It manually unrolls the inner loops, because I've found that in certain scenarios the compiler wasn't unrolling the inner loops. If that was the case for c7g, then I would expect significant performance improvements.
  3. PR [huf] Improve fast huffman decoding speed in linux kernel #3826 also provides the ability to disable the fast C decoding loops at compile time by defining the C macro HUF_DISABLE_FAST_DECODE, in case you still see subpar performance after these PRs.

Please let me know if you see any more issues with performance after these PRs, and we will look into them. I can't guarantee a super fast turnaround time, but we always try to handle outstanding issues before we make releases.

terrelln pushed a commit that referenced this issue Nov 20, 2023
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
 is 0. Initially, I thought that when it got towards the end, the
 computation of how many iterations of safe might become expensive. But
 it ends up being slower to have to decode each of the 4 streams
 individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in #3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |
terrelln pushed a commit that referenced this issue Nov 20, 2023
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because `entry.nbBits` is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue #3762. So if even after this change, there is a
performance regression, users can opt-out at compile time.
@iksaif
Copy link
Author

iksaif commented Nov 21, 2023

Thanks for the update, I'll try this out in the next few weeks!

@iksaif
Copy link
Author

iksaif commented Nov 22, 2023

I can confirm that the latest version from git is back to v1.5.2 level of performances on c7g !

 1# 9114 files       :   7484607 ->   2603666 (x2.875),  183.3 MB/s,  435.3 MB/s
 1#silesia.tar       : 211957760 -> 132055991 (x1.605),  700.0 MB/s, 1143.6 MB/s

@iksaif iksaif closed this as completed Nov 22, 2023
@terrelln
Copy link
Contributor

Great, I'm glad to hear it! Looks like it was running into the loop unrolling problem.

gcflymoto pushed a commit to gcflymoto/zstd that referenced this issue Dec 9, 2023
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because `entry.nbBits` is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook#3762. So if even after this change, there is a
performance regression, users can opt-out at compile time.
gcflymoto pushed a commit to gcflymoto/zstd that referenced this issue Dec 9, 2023
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
 is 0. Initially, I thought that when it got towards the end, the
 computation of how many iterations of safe might become expensive. But
 it ends up being slower to have to decode each of the 4 streams
 individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |
xxmustafacooTR pushed a commit to xxmustafacooTR/exynos-linux-stable that referenced this issue Dec 16, 2023
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because  is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a
performance regression, users can opt-out at compile time.
xxmustafacooTR pushed a commit to xxmustafacooTR/exynos-linux-stable that referenced this issue Dec 17, 2023
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because  is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a
performance regression, users can opt-out at compile time.
xxmustafacooTR pushed a commit to xxmustafacooTR/exynos-linux-stable that referenced this issue Jan 15, 2024
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because  is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a
performance regression, users can opt-out at compile time.
xxRishikcooTR pushed a commit to xxRishikcooTR/kernel_samsung_exynos9810-V2 that referenced this issue Feb 4, 2024
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because  is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a
performance regression, users can opt-out at compile time.
hswong3i pushed a commit to alvistack/facebook-zstd that referenced this issue Mar 27, 2024
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because `entry.nbBits` is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue facebook#3762. So if even after this change, there is a
performance regression, users can opt-out at compile time.
hswong3i pushed a commit to alvistack/facebook-zstd that referenced this issue Mar 27, 2024
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
 is 0. Initially, I thought that when it got towards the end, the
 computation of how many iterations of safe might become expensive. But
 it ends up being slower to have to decode each of the 4 streams
 individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants