Optimize by prefetching on aarch64 #2040

caoyzh · 2020-03-16T01:45:03Z

Optimize compression by adding prefetch

Average gains	gcc9.2.0 clang9.0.0
level 1~2	3.10%	 3.69%
level 3~4	2.49%	 1.51%

Test environment

1)	Measured with lzbench, transplanting the code of Zstd’s develop branch. 
2)	The testfile is silesia.tar.
3)	The test environment is as follows:
				aarch64
Cpu name			Armv8-a
CPU(s):				128
Memory Device			DDR4 2666 MT/s 32 GB
Number Of Memory Devices	16

davidbolvansky · 2020-03-17T10:12:48Z

lib/compress/zstd_double_fast.c

@@ -198,6 +198,9 @@ size_t ZSTD_compressBlock_doubleFast_generic(
        }   }

        ip += ((ip-anchor) >> kSearchStrength) + 1;
+#if defined(__aarch64__)
+        PREFETCH_L1(ip+256);
+#endif


For X86 too?

For X86 too?

There is gains on X86 too. But It may has negative effects on decompression.

x86 compression decompression clang9.0 1.13% 1.54% gcc9.2.0 3.26% -1.55%

Don't worry about the decompression speed for compression-only changes (unless you are changing how the data gets compressed and don't have identical compressed output). That is just noise, and we shouldn't take it into account.

I've found on x86 it is better to put this prefetch before line 142 (hash table update). Does that work on aarch64 too?

I've found on x86 it is better to put this prefetch before line 142 (hash table update). Does that work on aarch64 too?

Putting this prefetch before line 142 is worse in arrch64/clang9.0.0，is similar in arrch64/gcc9.2.0.

@terrelln Do I need to remove the aarch64 switch?

Let's land this patch with the #if defined(__aarch64__) then put up a second patch that just deletes them. I don't want to block clear aarch64 improvements with x86 benchmarking.

terrelln

I see +3% compression speed for fast and dfast strategies on a Pixel 2. This looks promising. I will measure on x86, and another aarch64 phone next.

There are more variants of the fast and dfast strategies. Can you please apply the same change to the other variants as well? They are used for dictionary and streaming compression.

terrelln · 2020-03-19T21:58:43Z

I've measured a ~5% win on x86 on silesia.tar and a ~2% win for enwik7. I've found that level 3 benefits more.

I've found that certain files benefit a lot, and others don't benefit at all. At level 3 for individual files in silesia the gains are approximately:

file	gain
dickens	0-1%
mozilla	5-6%
mr	0-1%
nci	0-1%
ooffice	0-1%
osdb	0-1%
reymont	0-1%
samba	2-4%
sao	0-1%
webster	3-6%
xml	0-1%
x-ray	0-1%

I don't see any super obvious correlation between the files that benefit. I certainly want to land this PR, but I want to understand why it works well on these particular files first.

The reasons I could think of are:

They have long matches.
They have long runs of literals and end up in "skipping" mode.

terrelln · 2020-03-19T22:23:49Z

I measured on an Xeon CPU and saw at least 2% gains on mozilla. They are there, but not as prevalent. This server is less stable, so some of the gains could be hidden in the noise.

caoyzh · 2020-03-23T14:02:53Z

I see +3% compression speed for fast and dfast strategies on a Pixel 2. This looks promising. I will measure on x86, and another aarch64 phone next.

There are more variants of the fast and dfast strategies. Can you please apply the same change to the other variants as well? They are used for dictionary and streaming compression.

Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 .
It varies with the compiler version.

terrelln · 2020-03-24T01:17:16Z

Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 .
It varies with the compiler version.

Do you mean that prefetching doesn't help ZSTD_compressBlock_fast_extDict_generic or ZSTD_compressBlock_doubleFast_extDict_generic? That is what I meant by variant of the function, sorry if it wasn't clear. These functions get called in streaming mode, so you can test speed by compressing with the CLI zstd someBigFile -o /dev/null.

Side note, do you have perf counters that show that cache misses go down on either aarch64 or x86? I want to make sure that we're actually getting gains from prefetching, and not just from the compiler emitting slightly different code.

caoyzh · 2020-03-24T14:05:48Z

Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 .
It varies with the compiler version.

Do you mean that prefetching doesn't help ZSTD_compressBlock_fast_extDict_generic or ZSTD_compressBlock_doubleFast_extDict_generic? That is what I meant by variant of the function, sorry if it wasn't clear. These functions get called in streaming mode, so you can test speed by compressing with the CLI zstd someBigFile -o /dev/null.

Side note, do you have perf counters that show that cache misses go down on either aarch64 or x86? I want to make sure that we're actually getting gains from prefetching, and not just from the compiler emitting slightly different code.

cache misses has indeed gone down either aarch64 or x86，but alse effect the arrangement of instructions that can cause negative optimizations.

terrelln · 2020-03-28T01:43:09Z

Does the same patch help ZSTD_compressBlock_fast_extDict_generic and ZSTD_compressBlock_doubleFast_extDict_generic? See my comment above on how to benchmark it.

caoyzh · 2020-03-28T10:52:19Z

Does the same patch help ZSTD_compressBlock_fast_extDict_generic and ZSTD_compressBlock_doubleFast_extDict_generic? See my comment above on how to benchmark it.

The cache misses also go down when testing with external dictionary.

caoyzh · 2020-04-02T12:20:55Z

@terrelln It looks like this patch is ready to merge for me, any other suggestions? Please let me know if any other actions should be taken before merge.

terrelln

Looks good to me! I've filed follow up tasks in #2077.

Optimize by prefetching on aarch64

7201980

facebook-github-bot added the CLA Signed label Mar 16, 2020

davidbolvansky reviewed Mar 17, 2020

View reviewed changes

terrelln reviewed Mar 19, 2020

View reviewed changes

terrelln approved these changes Apr 8, 2020

View reviewed changes

terrelln merged commit 5fcbc48 into facebook:dev Apr 8, 2020

Yikun mentioned this pull request May 20, 2020

让压缩库ZSTD在aarch64更顺滑 kunpengcompute/kunpengcompute.github.io#30

Open

Cyan4973 mentioned this pull request May 21, 2020

v1.4.5 #2154

Merged

Yikun mentioned this pull request Aug 2, 2021

让压缩库ZSTD在ARM上更顺滑 Yikun/yikun.github.com#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize by prefetching on aarch64 #2040

Optimize by prefetching on aarch64 #2040

caoyzh commented Mar 16, 2020 •

edited

Loading

davidbolvansky Mar 17, 2020

caoyzh Mar 19, 2020 •

edited

Loading

terrelln Mar 19, 2020

terrelln Mar 19, 2020

caoyzh Mar 23, 2020

caoyzh Mar 27, 2020

terrelln Mar 28, 2020

terrelln left a comment

terrelln commented Mar 19, 2020

terrelln commented Mar 19, 2020

caoyzh commented Mar 23, 2020 •

edited

Loading

terrelln commented Mar 24, 2020

caoyzh commented Mar 24, 2020

terrelln commented Mar 28, 2020

caoyzh commented Mar 28, 2020

caoyzh commented Apr 2, 2020

terrelln left a comment

Optimize by prefetching on aarch64 #2040

Optimize by prefetching on aarch64 #2040

Conversation

caoyzh commented Mar 16, 2020 • edited Loading

davidbolvansky Mar 17, 2020

Choose a reason for hiding this comment

caoyzh Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

terrelln Mar 19, 2020

Choose a reason for hiding this comment

terrelln Mar 19, 2020

Choose a reason for hiding this comment

caoyzh Mar 23, 2020

Choose a reason for hiding this comment

caoyzh Mar 27, 2020

Choose a reason for hiding this comment

terrelln Mar 28, 2020

Choose a reason for hiding this comment

terrelln left a comment

Choose a reason for hiding this comment

terrelln commented Mar 19, 2020

terrelln commented Mar 19, 2020

caoyzh commented Mar 23, 2020 • edited Loading

terrelln commented Mar 24, 2020

caoyzh commented Mar 24, 2020

terrelln commented Mar 28, 2020

caoyzh commented Mar 28, 2020

caoyzh commented Apr 2, 2020

terrelln left a comment

Choose a reason for hiding this comment

caoyzh commented Mar 16, 2020 •

edited

Loading

caoyzh Mar 19, 2020 •

edited

Loading

caoyzh commented Mar 23, 2020 •

edited

Loading