-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible performance regressions on some CPUs after #3449 (C fast loops) #3762
Comments
CC: @terrelln since you authored the original PR you are probably interested about this I haven't looked at why things might be slower, but given the results it might be interesting to offer the option to not build the generic C versions of the fast decoding loops since it's unclear that they offer a significant performance boost on modern CPUs and compilers Also maybe I did something wrong in those tests, I'm happy to re-run them if necessary |
This also relates to:
|
facebook#3762 seems to show that it doesn't perform as well as we though it would in many cases. It makes sense to at least allow users to disable them at buildtime and runtime.
Thanks for the report @iksaif! I will spend some time investigating next week. At first glance, the performance on |
👋 any finding ? thanks! |
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook#3762. So if even after this change, there is a performance regression, users can opt-out at compile time.
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue #3762. So if even after this change, there is a performance regression, users can opt-out at compile time.
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |
Hi @iksaif sorry for the delay, but I have several updates:
Please let me know if you see any more issues with performance after these PRs, and we will look into them. I can't guarantee a super fast turnaround time, but we always try to handle outstanding issues before we make releases. |
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in #3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue #3762. So if even after this change, there is a performance regression, users can opt-out at compile time.
Thanks for the update, I'll try this out in the next few weeks! |
I can confirm that the latest version from git is back to
|
Great, I'm glad to hear it! Looks like it was running into the loop unrolling problem. |
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook#3762. So if even after this change, there is a performance regression, users can opt-out at compile time.
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a performance regression, users can opt-out at compile time.
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a performance regression, users can opt-out at compile time.
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a performance regression, users can opt-out at compile time.
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook/zstd#3762 . So if even after this change, there is a performance regression, users can opt-out at compile time.
gcc in the linux kernel was not unrolling the inner loops of the Huffman decoder, which was destroying decoding performance. The compiler was generating crazy code with all sorts of branches. I suspect because of Spectre mitigations, but I'm not certain. Once the loops were manually unrolled, performance was restored. Additionally, when gcc couldn't prove that the variable left shift in the 4X2 decode loop wasn't greater than 63, it inserted checks to verify it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete this check. This is a no op, because `entry.nbBits` is guaranteed to be less than 64. Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the fast C loops for Issue facebook#3762. So if even after this change, there is a performance regression, users can opt-out at compile time.
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of `src + 6 + 8`. This is safe because the fast decoding loops guarantee to never read below `ilowest` already. This allows the fast decoder to run for at least two more iterations, because it consumes at most 7 bytes per iteration. * Continue the fast loop all the way until the number of safe iterations is 0. Initially, I thought that when it got towards the end, the computation of how many iterations of safe might become expensive. But it ends up being slower to have to decode each of the 4 streams individually, which makes sense. This drastically speeds up the Huffman decoder on the `github` dataset for the issue raised in facebook#3762, measured with `zstd -b1e1r github/`. | Decoder | Speed before | Speed after | |----------|--------------|-------------| | Fallback | 477 MB/s | 477 MB/s | | Fast C | 384 MB/s | 492 MB/s | | Assembly | 385 MB/s | 501 MB/s | We can also look at the speed delta for different block sizes of silesia using `zstd -b1e1r silesia.tar -B#`. | Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ | |----------|--------|--------|--------|--------|---------|---------|---------|----------| | Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% | | Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |
Describe the bug
It seems that #3449 introduced a performance regression on modern CPUs, this is particularly problematic on aarch64/arm64 since those don't have an ASM function and will always use the fast loop functions.
To Reproduce
On Darwin: (Apple clang version 14.0.3 (clang-1403.0.22.14.1))
On Linux: (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0)
Expected behavior
Performance should not degrade between v1.5.2 and v1.5.4
Screenshots and charts
b1:
zstd -b1 -r github
b2:
zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
🟢 - best performances for this machine
🔴 - worst performances for this machine
Summary:
v1.5.4 + disableAsm
always gives the worst performances for both tests - and this versions uses the Generic C fast loopsv1.5.4 + disableAsm + disableFast
: is slightly slower that the ASM function for the second benchmark but faster for the first onev1.5.2
Desktop (please complete the following information):
See above
The text was updated successfully, but these errors were encountered: