Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LoongArch SX SIMD extension implementation #981

Merged
merged 1 commit into from
Dec 4, 2024
Merged

Conversation

lrzlin
Copy link
Contributor

@lrzlin lrzlin commented Nov 28, 2024

Add LoongArch SX SIMD extension support to xxHash, which is a new RISC ISA just like RISC-V. The code itself is basically a rewritten of the SSE2 one, tested on Loongson 3A6000 processor, here are the results.

Scalar

./xxhsum --benchmark-all
xxhsum 0.8.3 by Yann Collet
compiled as 64-bit unknown little endian with GCC 14.1.1 20240720
Sample of 100 KB...
 1#XXH32                         :     102400 ->    60583 it/s ( 5916.3 MB/s)
 2#XXH32 unaligned               :     102400 ->    58909 it/s ( 5752.8 MB/s)
 3#XXH64                         :     102400 ->   117056 it/s (11431.2 MB/s)
 4#XXH64 unaligned               :     102400 ->   118304 it/s (11553.1 MB/s)
 5#XXH3_64b                      :     102400 ->   142790 it/s (13944.4 MB/s)
 6#XXH3_64b unaligned            :     102400 ->   136549 it/s (13334.9 MB/s)
 7#XXH3_64b w/seed               :     102400 ->   117327 it/s (11457.8 MB/s)
 8#XXH3_64b w/seed unaligned     :     102400 ->   114720 it/s (11203.1 MB/s)
 9#XXH3_64b w/secret             :     102400 ->   144445 it/s (14106.0 MB/s)
10#XXH3_64b w/secret unaligned   :     102400 ->   138076 it/s (13483.9 MB/s)
11#XXH128                        :     102400 ->   114568 it/s (11188.3 MB/s)
12#XXH128 unaligned              :     102400 ->   111621 it/s (10900.5 MB/s)
13#XXH128 w/seed                 :     102400 ->   144003 it/s (14062.8 MB/s)
14#XXH128 w/seed unaligned       :     102400 ->   137411 it/s (13419.1 MB/s)
15#XXH128 w/secret               :     102400 ->   151320 it/s (14777.3 MB/s)
16#XXH128 w/secret unaligned     :     102400 ->   143261 it/s (13990.4 MB/s)
17#XXH32_stream                  :     102400 ->    60783 it/s ( 5935.8 MB/s)
18#XXH32_stream unaligned        :     102400 ->    56481 it/s ( 5515.7 MB/s)
19#XXH64_stream                  :     102400 ->   116072 it/s (11335.1 MB/s)
20#XXH64_stream unaligned        :     102400 ->   105525 it/s (10305.2 MB/s)
21#XXH3_stream                   :     102400 ->   116267 it/s (11354.2 MB/s)
22#XXH3_stream unaligned         :     102400 ->   112994 it/s (11034.6 MB/s)
23#XXH3_stream w/seed            :     102400 ->   115983 it/s (11326.5 MB/s)
24#XXH3_stream w/seed unaligned  :     102400 ->   112852 it/s (11020.7 MB/s)
25#XXH128_stream                 :     102400 ->   116232 it/s (11350.8 MB/s)
26#XXH128_stream unaligned       :     102400 ->   112905 it/s (11025.8 MB/s)
27#XXH128_stream w/seed          :     102400 ->   116081 it/s (11336.1 MB/s)
28#XXH128_stream w/seed unaligne :     102400 ->   112789 it/s (11014.6 MB/s)

LoongArch SX

./xxhsum --benchmark-all
xxhsum 0.8.3 by Yann Collet
compiled as 64-bit loongarch little endian with GCC 14.1.1 20240720
Sample of 100 KB...
 1#XXH32                         :     102400 ->    60569 it/s ( 5915.0 MB/s)
 2#XXH32 unaligned               :     102400 ->    58907 it/s ( 5752.6 MB/s)
 3#XXH64                         :     102400 ->   117053 it/s (11431.0 MB/s)
 4#XXH64 unaligned               :     102400 ->   118289 it/s (11551.6 MB/s)
 5#XXH3_64b                      :     102400 ->   199076 it/s (19441.0 MB/s)
 6#XXH3_64b unaligned            :     102400 ->   189993 it/s (18554.0 MB/s)
 7#XXH3_64b w/seed               :     102400 ->   199722 it/s (19504.1 MB/s)
 8#XXH3_64b w/seed unaligned     :     102400 ->   189672 it/s (18522.7 MB/s)
 9#XXH3_64b w/secret             :     102400 ->   183999 it/s (17968.7 MB/s)
10#XXH3_64b w/secret unaligned   :     102400 ->   176080 it/s (17195.3 MB/s)
11#XXH128                        :     102400 ->   198883 it/s (19422.2 MB/s)
12#XXH128 unaligned              :     102400 ->   189925 it/s (18547.3 MB/s)
13#XXH128 w/seed                 :     102400 ->   198877 it/s (19421.6 MB/s)
14#XXH128 w/seed unaligned       :     102400 ->   188807 it/s (18438.2 MB/s)
15#XXH128 w/secret               :     102400 ->   183770 it/s (17946.3 MB/s)
16#XXH128 w/secret unaligned     :     102400 ->   175811 it/s (17169.0 MB/s)
17#XXH32_stream                  :     102400 ->    60782 it/s ( 5935.7 MB/s)
18#XXH32_stream unaligned        :     102400 ->    56488 it/s ( 5516.4 MB/s)
19#XXH64_stream                  :     102400 ->   116075 it/s (11335.4 MB/s)
20#XXH64_stream unaligned        :     102400 ->   105448 it/s (10297.7 MB/s)
21#XXH3_stream                   :     102400 ->   199438 it/s (19476.4 MB/s)
22#XXH3_stream unaligned         :     102400 ->   190148 it/s (18569.2 MB/s)
23#XXH3_stream w/seed            :     102400 ->   198714 it/s (19405.7 MB/s)
24#XXH3_stream w/seed unaligned  :     102400 ->   189577 it/s (18513.3 MB/s)
25#XXH128_stream                 :     102400 ->   199140 it/s (19447.3 MB/s)
26#XXH128_stream unaligned       :     102400 ->   190063 it/s (18560.8 MB/s)
27#XXH128_stream w/seed          :     102400 ->   198738 it/s (19408.0 MB/s)
28#XXH128_stream w/seed unaligne :     102400 ->   189668 it/s (18522.3 MB/s)

Reference
LoongArch Intrinsics Document: https://jia.je/unofficial-loongarch-intrinsics-guide/

Resolve
loongson-community/discussions#72

@Cyan4973 Cyan4973 self-assigned this Dec 2, 2024
@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 2, 2024

This is a nice patch, well laid out, with the minimal amount of changes required to support LoongArch SX SIMD.
The patch also describes the expected performance gains, which are large enough (from 11-13 GB/s to 17-19 GB/s) to justify the added code.

So the remaining question is around maintenance:
at this point, the SX SIMD code is not tested during CI.
This makes it hard to check that it works as intended.
Even worse, if the feature is broken in the future by some unrelated change, we'll be blind to it.

To properly cover this scenario, we would prefer to have some kind of qemu test (assuming native hardware test is unavailable), which would be run on Github Actions at every PR.
Is that possible ?

@t-mat
Copy link
Contributor

t-mat commented Dec 2, 2024

I don't test it yet, but it seems QEMU 9.0 supports LoongArch and its LSX extension.
Though standard apt repositories doesn't include their binary for Linux systems yet.

Also, there's an official documment for qemu-system-loongarch64.
But, this document suggests to download their prebuilt binary from non-offical personal repository which may be unacceptable for our CI :(

@lrzlin
Copy link
Contributor Author

lrzlin commented Dec 2, 2024

This is a nice patch, well laid out, with the minimal amount of changes required to support LoongArch SX SIMD. The patch also describes the expected performance gains, which are large enough (from 11-13 GB/s to 17-19 GB/s) to justify the added code.

Thanks for your affirmation!

So the remaining question is around maintenance: at this point, the SX SIMD code is not tested during CI. This makes it hard to check that it works as intended. Even worse, if the feature is broken in the future by some unrelated change, we'll be blind to it.

To properly cover this scenario, we would prefer to have some kind of qemu test (assuming native hardware test is unavailable), which would be run on Github Actions at every PR. Is that possible ?

Yes it is possible, I'll trying to add qemu-loongarch64-system test to it ASAP.

@xen0n
Copy link

xen0n commented Dec 2, 2024

But, this document suggests to download their prebuilt binary from non-offical personal repository which may be unacceptable for our CI :(

For the record, the documentation needs updating now, as prebuilt LoongArch EDK2 firmware for system-level emulation has been available since this QEMU commit which will ship in QEMU 9.2.0.

Nevertheless, it is irrelevant, because linux-user emulation does not need the firmware blob at all, so I think it would be acceptable as long as relevant packages (emulator & toolchain) are available at distros' upstream repositories.

@lrzlin
Copy link
Contributor Author

lrzlin commented Dec 2, 2024

@Cyan4973 Luckily, ubuntu-24.04 includes all cross-compile toolchain and qemu-loongarch64-static we need, I've added them to CI and it has passed without any issue.

BTW, the Windows ARM32 CI seems to be broken due to the Windows SDK removed the support of ARM32, see

error MSB8037: The Windows SDK version 10.0.26100.0 for Desktop C++ ARM Apps was not found. Install the required version of Windows SDK or change the SDK version in the project property pages or by right-clicking the solution and selecting "Retarget solution"

So maybe we should deleted it? nevertheless, we could discuss this issue in another PR.

@t-mat
Copy link
Contributor

t-mat commented Dec 3, 2024

Nevertheless, it is irrelevant, because linux-user emulation does not need the firmware blob at all, so I think it would be acceptable as long as relevant packages (emulator & toolchain) are available at distros' upstream repositories.

You're right.

Luckily, ubuntu-24.04 includes all cross-compile toolchain and qemu-loongarch64-static we need

Thanks for the information. I've checked them by apt-cache search qemu | grep "^qemu-system-.*QEMU full system". But their package description doesn't match it. Since it seems that it doesn't follow the classic QEMU description, this grep pattern should be changed.

So maybe we should deleted it?

It seems recent version of Windows SDK (10.0.26100.0?) has dropped ARM32. Threfore I think it's okay to remove them from our CI.

Copy link
Owner

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

@lrzlin lrzlin changed the title Add LoongArch SX SIMD extenstion implementation Add LoongArch SX SIMD extension implementation Dec 4, 2024
@Cyan4973 Cyan4973 merged commit 9b1d788 into Cyan4973:dev Dec 4, 2024
62 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants