This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits #5248

rdolbeau · 2016-10-08T16:32:09Z

No description provided.

mention-bot · 2016-10-08T16:32:10Z

@rdolbeau, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ironMann, @behlendorf and @FransUrbo to be potential reviewers.

rdolbeau · 2016-10-08T16:35:56Z

This is not useful on micro-architecture with weak NEON implementation (only 64 bits); the native version is slower & the byteswap barely faster than scalar.
On A57, it's barely useful on scalar but OK for byteswap:

cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon

It might be more useful on e.g. A53, which should have NEON performance similar to A57 but weaker scalar performance. @behlendorf , if you could confirm performance on A53 it would be great.

ironMann · 2016-10-08T16:52:54Z

@rdolbeau I think you'll need to rebase. Unfortunately, you missed 482cd9e, which changes things around a bit. Mainly, you have to adapt to ctx object.

rdolbeau · 2016-10-08T16:58:38Z

@ironMann Thanks for the notice. The context thing is something I wanted to have, 4*64 bits inzio_cksum_t wasn't enough storage.

ironMann · 2016-10-08T17:05:57Z

@rdolbeau Also, this code need to work in following way: init(ctx), N x compute(ctx), and cksum = fini(ctx). So cksum transformation into cksum has to happen in fini()

rdolbeau · 2016-10-08T17:40:36Z

Should be up-to-date now.

ironMann

@rdolbeau I added my review and notes inline. Overall looks good!

ironMann · 2016-10-10T13:39:40Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+
+#include <linux/simd_aarch64.h>
+#include <sys/spa_checksum.h>
+#include <sys/byteorder.h>


<byteorder.h> not needed?

ironMann · 2016-10-10T13:41:52Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+	uint64_t A, B, C, D;
+	A = ctx->aarch64_neon[0].v[0] + ctx->aarch64_neon[0].v[1];
+	B = 2 * ctx->aarch64_neon[1].v[0] + 2 * ctx->aarch64_neon[1].v[1] -
+		ctx->aarch64_neon[0].v[1];


indentation should be 4 spaces on broken lines. (if github is showing this correctly)

If I use 4 spaces, make checkstyle complains I use spaces instead of tabs...

That's odd. First line must be indented with tabs (line C = ..., but when line breaks the last tab is replaced with 4 spaces.

PIBKAC... I had /only/ 4 spaces, not tabs then 4 spaces...

ironMann · 2016-10-10T13:47:13Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+{
+	const uint64_t *ip = buf;
+	const uint64_t *ipend = (uint64_t *)((uint8_t *)ip + size);
+	uint64_t v0[2], v1[2], v2[2], v3[2];


You'll have to make sure this build for userspace and kernel. Also, compiler is more strict when configured with --enable-debug, so you should do that locally. Builders and testers are also using that switch.

Additionally, it would be more clear if uint64_t v1[2] could be written as zfs_fletcher_aarch64_neon_t v1. Or, if __attribute__((vector_size(16))) could be moved to zfs_fletcher_aarch64_neon_t struct somehow.

My bad - v0 to v3 are leftovers from the pre- 482cd9e I think. They are useless. Some A to D as well.

Update: Seems fine with --enable-debug w/o the useless variables.

ironMann · 2016-10-10T13:49:14Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+unsigned char TMP2 __attribute__((vector_size(16)));
+unsigned char SRC __attribute__((vector_size(16)));
+#endif
+


kfpu_begin() / kfpu_end() this asm

Fixed ; disappeared when updating to 482cd9e - I wasn't paying attention it seems :-/

ironMann · 2016-10-10T13:58:20Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+unsigned char SRC __attribute__((vector_size(16)));
+#endif
+
+	asm("eor %[ZERO].16b,%[ZERO].16b,%[ZERO].16b\n"


These ctx load/store block are repeating for byteswap as well. IMO, it's clearer to make macros out of them, like this.

Sorry, I'm a stickler for laying down unnecessary asm :)

ironMann · 2016-10-10T14:02:47Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+		: [ZERO] "=w" (ZERO),
+		[ACC0] "=w" (ACC0), [ACC1] "=w" (ACC1),
+		[ACC2] "=w" (ACC2), [ACC3] "=w" (ACC3)
+		: [CTX0] "Q" (ctx->aarch64_neon[0].v[0]),


I believe this should be : [CTX0] "Q" (ctx->aarch64_neon[0]). We had this issue before when optimized build does more aggressive aliasing and/or escape analysis, and optimize-out operations on ctx->aarch64_neon[0].v[1], because it's not referenced. Here that wouldn't be an issue probably, but it's more correct to reference the whole thing.

ironMann · 2016-10-10T14:05:19Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+		[CTX3] "Q" (ctx->aarch64_neon[3].v[0]));
+
+	for (; ip < ipend; ip += 2) {
+		asm("ld1 { %[SRC].4s }, %[IP]\n"


NOTE: we don't know alignment of *src, but the size is multiple of 64B at least. Just make sure load can do unaligned access.

AFAICT from ARM DDI 0487A.j, ld1 can do unaligned access. But it's not the clearer documentation I've read in my life :-/

ironMann · 2016-10-10T14:06:59Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+}
+
+static void
+fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx,


same comments apply as for fletcher_4_aarch64_neon_native()

ironMann · 2016-10-10T14:12:10Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+	uint64_t v0[2], v1[2], v2[2], v3[2];
+	uint64_t A, B, C, D;
+#if defined(_KERNEL)
+register unsigned char ZERO asm("v0") __attribute__((vector_size(16)));


This is a simple method that does not call into stuff. Could you just put used simd regs in the clobber list instead of these declarations? Would save a ton of space...

It might reduce (source) code size, but it makes the code less readable IMHO (explicit names are easier to read than register numbers, in particular for data flow between blocks). It would also make the code more dependent on compiler behavior, since data are passed between blocks. The only block-local values are TMP1 and TMP2.

Experience has taught me to write ASM blocks with belt, suspenders, and some extra glue just in case :-)

My intention was to reduce difference between kernel and user-space code. I'm fine with this as is...

behlendorf

@rdolbeau sure I'm happy to put this through its paces once the aarch64 build issue is addressed.

behlendorf · 2016-10-10T16:20:39Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+static void
+fletcher_4_aarch64_neon_init(fletcher_4_ctx_t *ctx)
+{
+	bzero(ctx->aarch64_neon, 4 * sizeof (zfs_fletcher_aarch64_neon_t));


We're not picking up sysmacros.h in the aarch64 build for some reason resulting in bzero() being undefined. You could just use memset() here which is what bzero() gets defined too, or adding the missing header.

I've added "sys/sysmacros.h"

Doesn't help. I've switched to <strings.h> (which _sse.c includes, and it uses bzero as well), and that seems OK with --enable-debug.

ironMann · 2016-10-19T12:30:28Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+unsigned char TMP2 __attribute__((vector_size(16)));
+unsigned char SRC __attribute__((vector_size(16)));
+#endif
+


kfpu_begin() / kfpu_end()

Forgot that one, sorry. And thanks everyone for all the reviews.

behlendorf · 2016-10-20T05:14:21Z

@rdolbeau here are the results from an A53 system.

$ cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation   native         byteswap       
scalar           1008227510     755880264      
aarch64_neon     1198098720     1044818671     
fastest          aarch64_neon   aarch64_neon

You're original A57 results for comparison.

$ cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon

rdolbeau · 2016-10-20T06:08:41Z

@behlendorf Thanks for the number. Not that useful on the A53 either then :-( Perhaps it will be better on future server-oriented core with more powerful NEON implementations (Vulcan ?).

…/ ARMv8 64 bits.

rdolbeau · 2016-10-21T14:48:14Z

For cores with really weak SIMD capabilities (i.e. some Aarch64 cores...), it's faster to just expose some more instruction-level parallelism to the core by using the same algorithm than SSE/NEON/AVX2/..., but using pure C... See e.g. #5317

behlendorf · 2016-10-21T17:36:21Z

@rdolbeau this looks ready to merge. Let me know if you're happy with this as the final version.

rdolbeau · 2016-10-21T17:41:03Z

@behlendorf I'm happy with it - there's not much room for possible improvements in a code that only does sequential additions :-)

I tried mixing NEON and scalar assembly but it didn't seem to be very useful, and for cores with really weak NEON I ended up doing #5317 instead - and that should work on all architectures.

rdolbeau force-pushed the fletcher-neon branch from 4e98ca9 to fd9bc15 Compare October 8, 2016 16:39

rdolbeau changed the title ~~This is the Fletch4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits~~ This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits Oct 8, 2016

rdolbeau force-pushed the fletcher-neon branch 2 times, most recently from 7287839 to f0f884c Compare October 8, 2016 17:39

ironMann suggested changes Oct 10, 2016

View reviewed changes

behlendorf requested changes Oct 10, 2016

View reviewed changes

ironMann reviewed Oct 19, 2016

View reviewed changes

ironMann approved these changes Oct 19, 2016

View reviewed changes

behlendorf approved these changes Oct 20, 2016

View reviewed changes

rdolbeau force-pushed the fletcher-neon branch from 40a08f2 to 610c08f Compare October 20, 2016 06:59

This is the Fletcher4 algorithm implemented in pure NEON for Aarch64 …

610c08f

…/ ARMv8 64 bits.

behlendorf merged commit 24cdeaf into openzfs:master Oct 21, 2016

This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits #5248

This is the Fletcher4 algorithm implemented in NEON for Aarch64 / ARMv8 64 bits #5248

Conversation

rdolbeau commented Oct 8, 2016

mention-bot commented Oct 8, 2016

rdolbeau commented Oct 8, 2016

ironMann commented Oct 8, 2016

rdolbeau commented Oct 8, 2016

ironMann commented Oct 8, 2016 • edited Loading

rdolbeau commented Oct 8, 2016

ironMann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdolbeau Oct 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdolbeau Oct 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented Oct 20, 2016

rdolbeau commented Oct 20, 2016

rdolbeau commented Oct 21, 2016

behlendorf commented Oct 21, 2016

rdolbeau commented Oct 21, 2016

ironMann commented Oct 8, 2016 •

edited

Loading

rdolbeau Oct 17, 2016 •

edited

Loading

rdolbeau Oct 17, 2016 •

edited

Loading