-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize by prefetching on aarch64 #2040
Conversation
@@ -198,6 +198,9 @@ size_t ZSTD_compressBlock_doubleFast_generic( | |||
} } | |||
|
|||
ip += ((ip-anchor) >> kSearchStrength) + 1; | |||
#if defined(__aarch64__) | |||
PREFETCH_L1(ip+256); | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For X86 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For X86 too?
There is gains on X86 too. But It may has negative effects on decompression.
x86 compression decompression
clang9.0 1.13% 1.54%
gcc9.2.0 3.26% -1.55%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about the decompression speed for compression-only changes (unless you are changing how the data gets compressed and don't have identical compressed output). That is just noise, and we shouldn't take it into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found on x86 it is better to put this prefetch before line 142 (hash table update). Does that work on aarch64 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found on x86 it is better to put this prefetch before line 142 (hash table update). Does that work on aarch64 too?
Putting this prefetch before line 142 is worse in arrch64/clang9.0.0,is similar in arrch64/gcc9.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrelln Do I need to remove the aarch64 switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's land this patch with the #if defined(__aarch64__)
then put up a second patch that just deletes them. I don't want to block clear aarch64 improvements with x86 benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see +3% compression speed for fast and dfast strategies on a Pixel 2. This looks promising. I will measure on x86, and another aarch64 phone next.
There are more variants of the fast and dfast strategies. Can you please apply the same change to the other variants as well? They are used for dictionary and streaming compression.
I've measured a ~5% win on x86 on I've found that certain files benefit a lot, and others don't benefit at all. At level 3 for individual files in silesia the gains are approximately:
I don't see any super obvious correlation between the files that benefit. I certainly want to land this PR, but I want to understand why it works well on these particular files first. The reasons I could think of are:
|
I measured on an Xeon CPU and saw at least 2% gains on mozilla. They are there, but not as prevalent. This server is less stable, so some of the gains could be hidden in the noise. |
Actually we had tried many variants on gcc-4.8.5/arrch64, only this patch's modifcation is effective on gcc-9.2.0/arrch64 . |
Do you mean that prefetching doesn't help Side note, do you have perf counters that show that cache misses go down on either aarch64 or x86? I want to make sure that we're actually getting gains from prefetching, and not just from the compiler emitting slightly different code. |
cache misses has indeed gone down either aarch64 or x86,but alse effect the arrangement of instructions that can cause negative optimizations. |
Does the same patch help |
The cache misses also go down when testing with external dictionary. |
@terrelln It looks like this patch is ready to merge for me, any other suggestions? Please let me know if any other actions should be taken before merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I've filed follow up tasks in #2077.
Optimize compression by adding prefetch
Test environment