Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize by prefetching on aarch64 #2040

Merged
merged 1 commit into from
Apr 8, 2020
Merged

Optimize by prefetching on aarch64 #2040

merged 1 commit into from
Apr 8, 2020

Conversation

caoyzh
Copy link
Contributor

@caoyzh caoyzh commented Mar 16, 2020

Optimize compression by adding prefetch

Average gains	gcc9.2.0 clang9.0.0
level 1~2	3.10%	 3.69%
level 3~4	2.49%	 1.51%

Test environment

1)	Measured with lzbench, transplanting the code of Zstd’s develop branch. 
2)	The testfile is silesia.tar.
3)	The test environment is as follows:
				aarch64
Cpu name			Armv8-a
CPU(s):				128
Memory Device			DDR4 2666 MT/s 32 GB
Number Of Memory Devices	16

@@ -198,6 +198,9 @@ size_t ZSTD_compressBlock_doubleFast_generic(
} }

ip += ((ip-anchor) >> kSearchStrength) + 1;
#if defined(__aarch64__)
PREFETCH_L1(ip+256);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For X86 too?

Copy link
Contributor Author

@caoyzh caoyzh Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For X86 too?

There is gains on X86 too. But It may has negative effects on decompression.

x86				compression		decompression
clang9.0			1.13%			1.54%
gcc9.2.0			3.26%			-1.55%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about the decompression speed for compression-only changes (unless you are changing how the data gets compressed and don't have identical compressed output). That is just noise, and we shouldn't take it into account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found on x86 it is better to put this prefetch before line 142 (hash table update). Does that work on aarch64 too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found on x86 it is better to put this prefetch before line 142 (hash table update). Does that work on aarch64 too?

Putting this prefetch before line 142 is worse in arrch64/clang9.0.0,is similar in arrch64/gcc9.2.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrelln Do I need to remove the aarch64 switch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land this patch with the #if defined(__aarch64__) then put up a second patch that just deletes them. I don't want to block clear aarch64 improvements with x86 benchmarking.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see +3% compression speed for fast and dfast strategies on a Pixel 2. This looks promising. I will measure on x86, and another aarch64 phone next.

There are more variants of the fast and dfast strategies. Can you please apply the same change to the other variants as well? They are used for dictionary and streaming compression.

@terrelln
Copy link
Contributor

I've measured a ~5% win on x86 on silesia.tar and a ~2% win for enwik7. I've found that level 3 benefits more.

I've found that certain files benefit a lot, and others don't benefit at all. At level 3 for individual files in silesia the gains are approximately:

file gain
dickens 0-1%
mozilla 5-6%
mr 0-1%
nci 0-1%
ooffice 0-1%
osdb 0-1%
reymont 0-1%
samba 2-4%
sao 0-1%
webster 3-6%
xml 0-1%
x-ray 0-1%

I don't see any super obvious correlation between the files that benefit. I certainly want to land this PR, but I want to understand why it works well on these particular files first.

The reasons I could think of are:

  1. They have long matches.
  2. They have long runs of literals and end up in "skipping" mode.

@terrelln
Copy link
Contributor

I measured on an Xeon CPU and saw at least 2% gains on mozilla. They are there, but not as prevalent. This server is less stable, so some of the gains could be hidden in the noise.

@caoyzh
Copy link
Contributor Author

caoyzh commented Mar 23, 2020

I see +3% compression speed for fast and dfast strategies on a Pixel 2. This looks promising. I will measure on x86, and another aarch64 phone next.

There are more variants of the fast and dfast strategies. Can you please apply the same change to the other variants as well? They are used for dictionary and streaming compression.

Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 .
It varies with the compiler version.

@terrelln
Copy link
Contributor

Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 .
It varies with the compiler version.

Do you mean that prefetching doesn't help ZSTD_compressBlock_fast_extDict_generic or ZSTD_compressBlock_doubleFast_extDict_generic? That is what I meant by variant of the function, sorry if it wasn't clear. These functions get called in streaming mode, so you can test speed by compressing with the CLI zstd someBigFile -o /dev/null.

Side note, do you have perf counters that show that cache misses go down on either aarch64 or x86? I want to make sure that we're actually getting gains from prefetching, and not just from the compiler emitting slightly different code.

@caoyzh
Copy link
Contributor Author

caoyzh commented Mar 24, 2020

Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 .
It varies with the compiler version.

Do you mean that prefetching doesn't help ZSTD_compressBlock_fast_extDict_generic or ZSTD_compressBlock_doubleFast_extDict_generic? That is what I meant by variant of the function, sorry if it wasn't clear. These functions get called in streaming mode, so you can test speed by compressing with the CLI zstd someBigFile -o /dev/null.

Side note, do you have perf counters that show that cache misses go down on either aarch64 or x86? I want to make sure that we're actually getting gains from prefetching, and not just from the compiler emitting slightly different code.

cache misses has indeed gone down either aarch64 or x86,but alse effect the arrangement of instructions that can cause negative optimizations.

@terrelln
Copy link
Contributor

Does the same patch help ZSTD_compressBlock_fast_extDict_generic and ZSTD_compressBlock_doubleFast_extDict_generic? See my comment above on how to benchmark it.

@caoyzh
Copy link
Contributor Author

caoyzh commented Mar 28, 2020

Does the same patch help ZSTD_compressBlock_fast_extDict_generic and ZSTD_compressBlock_doubleFast_extDict_generic? See my comment above on how to benchmark it.

The cache misses also go down when testing with external dictionary.

@caoyzh
Copy link
Contributor Author

caoyzh commented Apr 2, 2020

@terrelln It looks like this patch is ready to merge for me, any other suggestions? Please let me know if any other actions should be taken before merge.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I've filed follow up tasks in #2077.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants