Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits #5248

Merged
merged 1 commit into from
Oct 21, 2016

Conversation

rdolbeau
Copy link
Contributor

@rdolbeau rdolbeau commented Oct 8, 2016

No description provided.

@mention-bot
Copy link

@rdolbeau, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ironMann, @behlendorf and @FransUrbo to be potential reviewers.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Oct 8, 2016

This is not useful on micro-architecture with weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar.
On A57, it's barely useful on scalar but OK for byteswap:

cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon   

It might be more useful on e.g. A53, which should have NEON performance similar to A57 but weaker scalar performance. @behlendorf , if you could confirm performance on A53 it would be great.

@rdolbeau rdolbeau changed the title This is the Fletch4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits Oct 8, 2016
@ironMann
Copy link
Contributor

ironMann commented Oct 8, 2016

@rdolbeau I think you'll need to rebase. Unfortunately, you missed 482cd9e, which changes things around a bit. Mainly, you have to adapt to ctx object.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Oct 8, 2016

@ironMann Thanks for the notice. The context thing is something I wanted to have, 4*64 bits inzio_cksum_t wasn't enough storage.

@ironMann
Copy link
Contributor

ironMann commented Oct 8, 2016

@rdolbeau Also, this code need to work in following way: init(ctx), N x compute(ctx), and cksum = fini(ctx). So cksum transformation into cksum has to happen in fini()

@rdolbeau rdolbeau force-pushed the fletcher-neon branch 2 times, most recently from 7287839 to f0f884c Compare October 8, 2016 17:39
@rdolbeau
Copy link
Contributor Author

rdolbeau commented Oct 8, 2016

Should be up-to-date now.

Copy link
Contributor

@ironMann ironMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdolbeau I added my review and notes inline. Overall looks good!


#include <linux/simd_aarch64.h>
#include <sys/spa_checksum.h>
#include <sys/byteorder.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<byteorder.h> not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

uint64_t A, B, C, D;
A = ctx->aarch64_neon[0].v[0] + ctx->aarch64_neon[0].v[1];
B = 2 * ctx->aarch64_neon[1].v[0] + 2 * ctx->aarch64_neon[1].v[1] -
ctx->aarch64_neon[0].v[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation should be 4 spaces on broken lines. (if github is showing this correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use 4 spaces, make checkstyle complains I use spaces instead of tabs...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's odd. First line must be indented with tabs (line C = ..., but when line breaks the last tab is replaced with 4 spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PIBKAC... I had /only/ 4 spaces, not tabs then 4 spaces...

{
const uint64_t *ip = buf;
const uint64_t *ipend = (uint64_t *)((uint8_t *)ip + size);
uint64_t v0[2], v1[2], v2[2], v3[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to make sure this build for userspace and kernel. Also, compiler is more strict when configured with --enable-debug, so you should do that locally. Builders and testers are also using that switch.

Additionally, it would be more clear if uint64_t v1[2] could be written as zfs_fletcher_aarch64_neon_t v1. Or, if __attribute__((vector_size(16))) could be moved to zfs_fletcher_aarch64_neon_t struct somehow.

Copy link
Contributor Author

@rdolbeau rdolbeau Oct 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad - v0 to v3 are leftovers from the pre- 482cd9e I think. They are useless. Some A to D as well.

Update: Seems fine with --enable-debug w/o the useless variables.

unsigned char TMP2 __attribute__((vector_size(16)));
unsigned char SRC __attribute__((vector_size(16)));
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfpu_begin() / kfpu_end() this asm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ; disappeared when updating to 482cd9e - I wasn't paying attention it seems :-/

unsigned char SRC __attribute__((vector_size(16)));
#endif

asm("eor %[ZERO].16b,%[ZERO].16b,%[ZERO].16b\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ctx load/store block are repeating for byteswap as well. IMO, it's clearer to make macros out of them, like this.

Sorry, I'm a stickler for laying down unnecessary asm :)

: [ZERO] "=w" (ZERO),
[ACC0] "=w" (ACC0), [ACC1] "=w" (ACC1),
[ACC2] "=w" (ACC2), [ACC3] "=w" (ACC3)
: [CTX0] "Q" (ctx->aarch64_neon[0].v[0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be : [CTX0] "Q" (ctx->aarch64_neon[0]). We had this issue before when optimized build does more aggressive aliasing and/or escape analysis, and optimize-out operations on ctx->aarch64_neon[0].v[1], because it's not referenced. Here that wouldn't be an issue probably, but it's more correct to reference the whole thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

[CTX3] "Q" (ctx->aarch64_neon[3].v[0]));

for (; ip < ipend; ip += 2) {
asm("ld1 { %[SRC].4s }, %[IP]\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: we don't know alignment of *src, but the size is multiple of 64B at least. Just make sure load can do unaligned access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT from ARM DDI 0487A.j, ld1 can do unaligned access. But it's not the clearer documentation I've read in my life :-/

}

static void
fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments apply as for fletcher_4_aarch64_neon_native()

uint64_t v0[2], v1[2], v2[2], v3[2];
uint64_t A, B, C, D;
#if defined(_KERNEL)
register unsigned char ZERO asm("v0") __attribute__((vector_size(16)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple method that does not call into stuff. Could you just put used simd regs in the clobber list instead of these declarations? Would save a ton of space...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might reduce (source) code size, but it makes the code less readable IMHO (explicit names are easier to read than register numbers, in particular for data flow between blocks). It would also make the code more dependent on compiler behavior, since data are passed between blocks. The only block-local values are TMP1 and TMP2.

Experience has taught me to write ASM blocks with belt, suspenders, and some extra glue just in case :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to reduce difference between kernel and user-space code. I'm fine with this as is...

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdolbeau sure I'm happy to put this through its paces once the aarch64 build issue is addressed.

static void
fletcher_4_aarch64_neon_init(fletcher_4_ctx_t *ctx)
{
bzero(ctx->aarch64_neon, 4 * sizeof (zfs_fletcher_aarch64_neon_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not picking up sysmacros.h in the aarch64 build for some reason resulting in bzero() being undefined. You could just use memset() here which is what bzero() gets defined too, or adding the missing header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added "sys/sysmacros.h"

Copy link
Contributor Author

@rdolbeau rdolbeau Oct 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't help. I've switched to <strings.h> (which _sse.c includes, and it uses bzero as well), and that seems OK with --enable-debug.

unsigned char TMP2 __attribute__((vector_size(16)));
unsigned char SRC __attribute__((vector_size(16)));
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfpu_begin() / kfpu_end()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot that one, sorry. And thanks everyone for all the reviews.

@behlendorf
Copy link
Contributor

@rdolbeau here are the results from an A53 system.

$ cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation   native         byteswap       
scalar           1008227510     755880264      
aarch64_neon     1198098720     1044818671     
fastest          aarch64_neon   aarch64_neon 

You're original A57 results for comparison.

$ cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon 

@rdolbeau
Copy link
Contributor Author

@behlendorf Thanks for the number. Not that useful on the A53 either then :-( Perhaps it will be better on future server-oriented core with more powerful NEON implementations (Vulcan ?).

@rdolbeau
Copy link
Contributor Author

For cores with really weak SIMD capabilities (i.e. some Aarch64 cores...), it's faster to just expose some more instruction-level parallelism to the core by using the same algorithm than SSE/NEON/AVX2/..., but using pure C... See e.g. #5317

@behlendorf
Copy link
Contributor

@rdolbeau this looks ready to merge. Let me know if you're happy with this as the final version.

@rdolbeau
Copy link
Contributor Author

@behlendorf I'm happy with it - there's not much room for possible improvements in a code that only does sequential additions :-)

I tried mixing NEON and scalar assembly but it didn't seem to be very useful, and for cores with really weak NEON I ended up doing #5317 instead - and that should work on all architectures.

@behlendorf behlendorf merged commit 24cdeaf into openzfs:master Oct 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants