-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits #5248
Conversation
@rdolbeau, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ironMann, @behlendorf and @FransUrbo to be potential reviewers. |
This is not useful on micro-architecture with weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar.
It might be more useful on e.g. A53, which should have NEON performance similar to A57 but weaker scalar performance. @behlendorf , if you could confirm performance on A53 it would be great. |
4e98ca9
to
fd9bc15
Compare
@rdolbeau I think you'll need to rebase. Unfortunately, you missed 482cd9e, which changes things around a bit. Mainly, you have to adapt to ctx object. |
@ironMann Thanks for the notice. The context thing is something I wanted to have, 4*64 bits inzio_cksum_t wasn't enough storage. |
@rdolbeau Also, this code need to work in following way: |
7287839
to
f0f884c
Compare
Should be up-to-date now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdolbeau I added my review and notes inline. Overall looks good!
|
||
#include <linux/simd_aarch64.h> | ||
#include <sys/spa_checksum.h> | ||
#include <sys/byteorder.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<byteorder.h> not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
uint64_t A, B, C, D; | ||
A = ctx->aarch64_neon[0].v[0] + ctx->aarch64_neon[0].v[1]; | ||
B = 2 * ctx->aarch64_neon[1].v[0] + 2 * ctx->aarch64_neon[1].v[1] - | ||
ctx->aarch64_neon[0].v[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation should be 4 spaces on broken lines. (if github is showing this correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use 4 spaces, make checkstyle complains I use spaces instead of tabs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's odd. First line must be indented with tabs (line C = ...
, but when line breaks the last tab is replaced with 4 spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PIBKAC... I had /only/ 4 spaces, not tabs then 4 spaces...
{ | ||
const uint64_t *ip = buf; | ||
const uint64_t *ipend = (uint64_t *)((uint8_t *)ip + size); | ||
uint64_t v0[2], v1[2], v2[2], v3[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have to make sure this build for userspace and kernel. Also, compiler is more strict when configured with --enable-debug
, so you should do that locally. Builders and testers are also using that switch.
Additionally, it would be more clear if uint64_t v1[2]
could be written as zfs_fletcher_aarch64_neon_t v1
. Or, if __attribute__((vector_size(16)))
could be moved to zfs_fletcher_aarch64_neon_t
struct somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad - v0 to v3 are leftovers from the pre- 482cd9e I think. They are useless. Some A to D as well.
Update: Seems fine with --enable-debug w/o the useless variables.
unsigned char TMP2 __attribute__((vector_size(16))); | ||
unsigned char SRC __attribute__((vector_size(16))); | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kfpu_begin()
/ kfpu_end()
this asm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed ; disappeared when updating to 482cd9e - I wasn't paying attention it seems :-/
unsigned char SRC __attribute__((vector_size(16))); | ||
#endif | ||
|
||
asm("eor %[ZERO].16b,%[ZERO].16b,%[ZERO].16b\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ctx load/store block are repeating for byteswap as well. IMO, it's clearer to make macros out of them, like this.
Sorry, I'm a stickler for laying down unnecessary asm :)
: [ZERO] "=w" (ZERO), | ||
[ACC0] "=w" (ACC0), [ACC1] "=w" (ACC1), | ||
[ACC2] "=w" (ACC2), [ACC3] "=w" (ACC3) | ||
: [CTX0] "Q" (ctx->aarch64_neon[0].v[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be : [CTX0] "Q" (ctx->aarch64_neon[0])
. We had this issue before when optimized build does more aggressive aliasing and/or escape analysis, and optimize-out operations on ctx->aarch64_neon[0].v[1], because it's not referenced. Here that wouldn't be an issue probably, but it's more correct to reference the whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
[CTX3] "Q" (ctx->aarch64_neon[3].v[0])); | ||
|
||
for (; ip < ipend; ip += 2) { | ||
asm("ld1 { %[SRC].4s }, %[IP]\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: we don't know alignment of *src
, but the size
is multiple of 64B at least. Just make sure load can do unaligned access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT from ARM DDI 0487A.j, ld1 can do unaligned access. But it's not the clearer documentation I've read in my life :-/
} | ||
|
||
static void | ||
fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments apply as for fletcher_4_aarch64_neon_native()
uint64_t v0[2], v1[2], v2[2], v3[2]; | ||
uint64_t A, B, C, D; | ||
#if defined(_KERNEL) | ||
register unsigned char ZERO asm("v0") __attribute__((vector_size(16))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple method that does not call into stuff. Could you just put used simd regs in the clobber list instead of these declarations? Would save a ton of space...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might reduce (source) code size, but it makes the code less readable IMHO (explicit names are easier to read than register numbers, in particular for data flow between blocks). It would also make the code more dependent on compiler behavior, since data are passed between blocks. The only block-local values are TMP1 and TMP2.
Experience has taught me to write ASM blocks with belt, suspenders, and some extra glue just in case :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to reduce difference between kernel and user-space code. I'm fine with this as is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdolbeau sure I'm happy to put this through its paces once the aarch64 build issue is addressed.
static void | ||
fletcher_4_aarch64_neon_init(fletcher_4_ctx_t *ctx) | ||
{ | ||
bzero(ctx->aarch64_neon, 4 * sizeof (zfs_fletcher_aarch64_neon_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not picking up sysmacros.h
in the aarch64 build for some reason resulting in bzero()
being undefined. You could just use memset()
here which is what bzero()
gets defined too, or adding the missing header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added "sys/sysmacros.h"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't help. I've switched to <strings.h> (which _sse.c includes, and it uses bzero as well), and that seems OK with --enable-debug.
unsigned char TMP2 __attribute__((vector_size(16))); | ||
unsigned char SRC __attribute__((vector_size(16))); | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kfpu_begin() / kfpu_end()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot that one, sorry. And thanks everyone for all the reviews.
@rdolbeau here are the results from an A53 system.
You're original A57 results for comparison.
|
@behlendorf Thanks for the number. Not that useful on the A53 either then :-( Perhaps it will be better on future server-oriented core with more powerful NEON implementations (Vulcan ?). |
40a08f2
to
610c08f
Compare
For cores with really weak SIMD capabilities (i.e. some Aarch64 cores...), it's faster to just expose some more instruction-level parallelism to the core by using the same algorithm than SSE/NEON/AVX2/..., but using pure C... See e.g. #5317 |
@rdolbeau this looks ready to merge. Let me know if you're happy with this as the final version. |
@behlendorf I'm happy with it - there's not much room for possible improvements in a code that only does sequential additions :-) I tried mixing NEON and scalar assembly but it didn't seem to be very useful, and for cores with really weak NEON I ended up doing #5317 instead - and that should work on all architectures. |
No description provided.