This is the parity generation/rebuild using 128-bits NEON for Aarch64. #4801

rdolbeau · 2016-06-26T14:57:19Z

This re-use the framework established for SSE and AVX2.
However, GCC is using FP registers on Aarch64, so unlike
SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2).

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

ironMann · 2016-06-28T06:28:05Z

GCC is using FP registers on Aarch64, so unlike
SSE/AVX2 we can't rely on the registers being left alone
between ASM statements.

This is technically true for SSE/AVX2 in userspace build, and IMO, we should address this by adding -mno-sse -mno-avx2 ... compiler flags, akin to kernel build flags.
Would that work here, with -mgeneral-regs-only -mnofp -mnosimd?

rdolbeau · 2016-06-28T19:06:51Z

This is technically true for SSE/AVX2 in userspace build,

I think GCC will not use SSE registers unless you really have FP, or you're compiling at -O3 or higher. OTOH, this could change and may not apply to other compilers...

My personal preference is to not rely on the compiler's behavior, and either

write self-contained ASM with clobbers (my original implementations, mostly);
explicit all inter-ASM dependencies using C variables and extended ASM syntax (this code);
avoid ASM and use built-ins (my favorite, not applicable to the kernel unfortunately).

Compiler flags could be used, but that's a fragile solution IMHO.

BTW, are you otherwise OK with my small changes to the SSE/AVX2 code ?

ironMann · 2016-06-28T21:14:30Z

@rdolbeau I admit compromises had to be made. Parity gen could be made in self-contained asm blocks, but that approach would get too unruly for reconstruction path. I refactored routines in a way that make sense for x86 and kernel ABI. If something else is needed, and final result can be achieved by tweaking compiler flags, I would say it is a way to go. Kernel build adds dozens of compiler flags anyway.

Everything could be rewritten in intrinsics, to get something like new scalar implementation. This would again add more changes to build flow. Note that linux kernel raid6 does this for neon linux/neon.uc

I'm fine with other changes. Also if mult. table is the same as one for SSSE3 and AVX2, it could be moved to let say vdev_raidz_math.c .

rdolbeau · 2016-06-29T06:02:17Z

If something else is needed, and final result can be achieved by tweaking compiler flags,
I would say it is a way to go. Kernel build adds dozens of compiler flags anyway.

The kernel already has the flags I think. Userland is the question - and Makefiles can easily be changes/overridden without looking at the code, hence my belief it is a fragile solution long-term.

What about C variables ? It' a bit of a pain to add, but should add some resiliency.

Note that linux kernel raid6 does this for neon linux/neon.uc

I missed that one, thanks for the pointer.

I'm fine with other changes. Also if mult. table is the same as one for SSSE3 and AVX2,
it could be moved to let say vdev_raidz_math.c .

It is the same table.

Cordially,

ironMann · 2016-06-29T06:19:28Z

hence my belief it is a fragile solution long-term.
What about C variables ? It' a bit of a pain to add, but should add some resiliency.

Variables become painful when different unrolling is required, as you've seen. I gave up on them mostly because of non-C-standardness. I figured since compiler should not do auto vectorization or other SIMD related business in these files, we can drop dependency and clobber register tracking.
For a proper fix I would go for intrinsic based solution.

We also have to make sure we don't make #4799 worse.

behlendorf · 2016-09-17T01:24:17Z

@rdolbeau sorry about this PR not getting any attention. Could you rebase it against master to resolve the conflicts.

@ironMann @tuxoko can you please review this since you're already familiar with this area of the code. I'll look it over as well and put it through its paces on an arm system.

rdolbeau · 2016-09-17T09:08:33Z

2016-09-17 3:24 GMT+02:00 Brian Behlendorf [email protected]:

@rdolbeau https://github.com/rdolbeau sorry about this PR not getting
any attention. Could you rebase it against master to resolve the conflicts.

Done. Didn't have time to test everything properly though, sorry.

Cordially,

Romain Dolbeau

rlaager · 2016-09-20T21:48:46Z

lib/libzpool/Makefile.am

@@ -99,6 +99,8 @@ KERNEL_C = \
 	vdev_raidz_math_sse2.c \
 	vdev_raidz_math_ssse3.c \
 	vdev_raidz_math_avx2.c \
+        vdev_raidz_math_neon.c \


This (and the next line) use spaces. They should be tabs. (I checked the raw file; this isn't just the GitHub web reviewer being wrong).

rlaager · 2016-09-21T14:46:31Z

I marked my review as approved (so it isn't blocking), but I'm not actually approving this pull request. I'm not familiar enough with the code in question.

ironMann

Looks good. Some changes required, and I would suggest only adding unrolled version for aarch64 (unless contraindicated). Also naming should better reflect the ISA, and zfs parameter man page must be updated. It would be interesting to see performance results from /proc/spl/kstat/zfs/vdev_raidz_bench (and also include them in commit message)

ironMann · 2016-09-21T14:52:57Z

module/zfs/vdev_raidz_math_neon2.c

+	.gen = RAIDZ_GEN_METHODS(neon2),
+	.rec = RAIDZ_REC_METHODS(neon2),
+	.is_supported = &raidz_will_neon2_work,
+	.name = "neon2"


@rdolbeau So far we have one implementation per instruction set, and each of them use as much regs as available.
Both of these are AArch64 NEON (ARMv8) ISA, which is required to have 32 regs. Do you anticipate some CPUs that will run non-unrolled version faster?
Also, the name should reflect this, because ARMv7 implementation is possible. Maybe "aarch64-neon" and "aarch64-neon_x2" for unrolled.

The actual performance in NEON will vary between CPU implementations, and those are more varied than in the x86-64 world. More unrolling should be at least as fast, but as a trade-off you get a potentially longer tail loop for blocks of a sub-optimal size. So in theory, one should pick the smallest unrolling that get maximum throughput... not easy.
I suspect neon2 would be a good default in most cases.
I don't have hard number to supply unfortunately, the systems where I have kernel headers to compile are under NDA :-(
I have access to a Jetson TX1 as well, but when I do 'make deb' (it's ubuntu-based), I get this ultimately:

make[1]: Leaving directory /home/ubuntu/spl' name=spl; \ version=0.7.0-rc1; \ arch=rpm -qp ${name}-${version}.src.rpm --qf %{arch} | tail -1`;
pkg1=${name}-${version}.${arch}.rpm;
fakeroot alien --bump=0 --scripts --to-deb $pkg1;
rm -f $pkg1
spl-0.7.0-rc1.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system

Weird...

Smallest block size is 512 and all possible sizes are multiple of that, or of larger power of 2. There's no need for tail loop fixup, as long as your stride divides 512. So 2x unrolled version will not incur that penalty.

They are renamed now.

Please also add a line to zfs-parameter man page. Unfortunately, these will not be selectable on module load, even though they are guarantied to work by aarch64 ISA, but only after zfs module is loaded. If this is a big deal we can address it later.

man page modified

ironMann · 2016-09-21T14:55:17Z

module/zfs/vdev_raidz_math_neon_common.h

+
+/* Overkill... */
+#if defined(_KERNEL)
+#define	GEN_X_DEFINE_ALL() \


Please make sure this patch builds with CFLAGS="-O0" and --enable-debug. Compiler might leave all of these on stack in that case, which would break the build.

With -O0 and --enable-debug, it explodes since apparently "-Werror=unused-variable" is enabled in this case - and I haven't had time to track down every combinations of macros to define just the right subset of variables.

It would be good if those are tracked down. There's also -Wframe-larger-than that's known to create issues with -O0 because the stack is not adjusted at all on that opt level.

It should mostly work now even at -O0
"the frame size of 1104 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]" - the default is 1024. If I put 1280, then the error go away (obviously :-) and then everything seems fine with both -O0 and --enable-debug.

We have run into this problem before, with -O0. IMO, you can apply same workaround, because it's artificial frame, and in a leaf function. See this pragma for fix. We don't want to break unoptimized build, and low frame size restriction is valid for other code paths.

I've added the pragma

ironMann · 2016-09-21T15:05:11Z

include/linux/simd_aarch64.h

+#include <sys/types.h>
+
+#if defined(_KERNEL)
+#include <asm/cpufeature.h>


For aarch64 you need to include asm/neon.h, define kfpu_begin() as kernel_neon_begin(), and kfpu_end() as kernel_neon_end()

<asm/fpu/api.h> is a x86 thing.

Should be fixed.

ironMann · 2016-09-21T15:13:39Z

cmd/raidz_test/raidz_test.h

@@ -34,6 +34,8 @@ static const char *raidz_impl_names[] = {
 	"sse2",
 	"ssse3",
 	"avx2",
+	"neon",
+	"neon2",


I don't have any hardware to test this patch, but it looks good. Please make sure you run cmd/raidz_test/raidz_test -S to unit-test this on supporting hardware. You can add -v for more verbose output. This is essentially what raidz_002_pos test will run for 5min, but we don't have any arm64 testers either.
also: see the other comment about naming, to avoid potential confusion with arm32 neon.

I've run raidz_test -S on a variety of hardware with no issue, since it's a pure userland binary it's much easier.
-v is not much use, since the raidz_test is multithreaded... and some of those systems have a lot of cores ;-) - the output is quite mangled.

ironMann · 2016-09-22T16:16:41Z

This LGTM 👍 . Too bad there are no arm64 tester machines...
@rdolbeau It would be better to squash raid-z related commits, and maybe leave simd_aarch64.h related one separate.
I'm still curious about what improvements you achieve :)

behlendorf

Architecturally this LGTM. I should have access to some aarch64 hardware in a week or two to get performance results from and do a little manual testing. Assuming everything looks good I'll merge this then. I'll also add the aarch64 hardware to the buildbot to improve the automated coverage.

rdolbeau · 2016-09-22T17:49:59Z

@rdolbeau It would be better to squash raid-z related commits, and maybe
leave simd_aarch64.h related one separate.

I've merged everything into a single commit with a slightly updated commit
message.

If I need to drop neon (and only keep neonx2), then I think I'll keep that
as a separate commit so that the code remains accessible in the future.

I'm still curious about what improvements you achieve :)

If I could figure out why I can't build the packages on ubuntu (see an
earlier message), I could give you hard numbers on A57s :-/
They get picked in /proc/spl/kstat/zfs/vdev_raidz_bench, but some Aarch64
CPU have pretty good scalar performance and only 64 bits NEON, obviously
the speed-up won't bet very high on those CPUs.

Cordially,

Romain Dolbeau

behlendorf · 2016-09-22T18:01:41Z

If I could figure out why I can't build the packages on ubuntu (see an earlier message),

spl-0.7.0-rc1.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system

That's strange. Hopefully someone will find some time to look in to it. But in the meanwhile you don't actually need to build packages. You can use the in-tree build directions described on the wiki. Then just run ./scripts/zfs.sh to load the modules and cat /proc/spl/kstat/zfs/vdev_raidz_bench.

rdolbeau · 2016-09-22T18:29:30Z

2016-09-22 20:01 GMT+02:00 Brian Behlendorf [email protected]:

But in the meanwhile you don't actually need to build packages.

For some reason I thought I needed to install spl.

Anyway, I ran into a different kind of problem:

CC [M] /home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.o
/home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.c: In function
‘zpl_root_iterate’:
/home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.c:60:2: error: implicit
declaration of function ‘dir_emit_dots’
[-Werror=implicit-function-declaration]
if (!dir_emit_dots(filp, ctx))
^
cc1: some warnings being treated as errors

The box is running "3.10.96-tegra", and google tells me 'dir_emit_dots'
showed up in 3.11? And since, as for most ARM box :-(, you're sort of
locked into whatever the manufacturer supplies you, I can't update the
kernel (even if I could our admins probably wouldn't let me mess with their
carefully deployed cluster anyway :-)

... but I sort of thought, that kernel would be recent enough to run ZFS?
zfsonlinux.org claims compatibility with some 2.6? I must be missing
something.

Cordially,

Romain Dolbeau

rdolbeau · 2016-09-22T18:45:47Z

2016-09-22 20:29 GMT+02:00 Romain Dolbeau [email protected]:

/home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.c:60:2: error: implicit
declaration of function ‘dir_emit_dots’

After replacing 3 instances of 'dir_emit_dots' by the (hopefully :-)
equivalent 'dir_emit' [patch attached], I get some hard numbers on a
Jetson TX1 with 4x A57 @ 1.73 GHz:

ubuntu@lionheart30:~$ cat /proc/spl/kstat/zfs/vdev_raidz_bench
14 0 0x01 -1 0 3033321603799016 3033322707202238
implementation gen_p gen_pq gen_pqr rec_p rec_q
rec_r rec_pq rec_pr rec_qr rec_pqr
original 374238204 108299160 41839226 403662514
79127510 12197176 16418637 6238009 6238642 4288504
scalar 106272331 79084683 51825239 113116426
62718695 48893024 53620500 44389129 31374830 27725964
aarch64_neon 610502681 296341328 116622748 544914564
211054470 165422137 156541643 131338726 87000925 74783358
aarch64_neonx2 706619294 296481412 165886500 545474897
210893224 165384561 187494049 145008566 108352914 92893608
fastest
aarch64_neonx2aarch64_neonx2aarch64_neonx2aarch64_neonx2aarch64_neonaarch64_neonaarch64_neonx2aarch64_neonx2aarch64_neonx2aarch64_neonx2

Cordially,

Romain Dolbeau

behlendorf · 2016-09-22T18:53:16Z

Yeah in practice the kernel version number doesn't tell you much about the kernel. Enterprise kernels are terrible in this regard, only being surpassed by manufacturer supplied arm kernels! It looks like in this case the configure checks didn't get things quite right.

This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitely disable vector registers, they have to be locally re-enabled explicitely. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more.

rdolbeau · 2016-09-30T06:48:56Z

Rebased.

behlendorf · 2016-09-30T21:16:12Z

@rdolbeau thanks. My aarch64 hardware arrived yesterday so I'll give your patch a spin over the weekend and if all looks good we can get it merged.

ironMann · 2016-10-01T11:34:03Z

@rdolbeau One last thing, could you please increase kstat header and column sizes (from 12) so that aarch64_neonx2 fits with some spare space (in raidz_math_kstat_headers () and raidz_math_kstat_data())

Related to this, the platform might also benefit from Fletcher4 neon implementation. The code should be very similar to ssse3 version (when PR #5164 lands).

…s aarch64_neonx2

behlendorf · 2016-10-02T17:40:56Z

Things are looking good in my local testing. No issues observed with multiple full sweeps of raidz_test, zloop.sh, or through normal usage of the filesystem. Unless anyone else has concerns I'll get this merged tomorrow.

The only surprising thing so far has been that the original implementation significantly out performs the scalar implementation for some of the benchmarks. Since we always pick the fastest version available this isn't an issue. But I think it is a nice illustration of how optimizing for one architecture can negatively impact others.

implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr         
original         167379518       57881107        28023603        169461606       29013464        4189038         8989947         2438610         2540972         1716722         
scalar           27674740        27405707        25523297        30319685        23672195        23675958        22972115        23021825        17095114        12282135        
aarch64_neon     253769603       156215032       71931986        230271641       125803355       102654140       71169286        63563620        49951524        44429830        
aarch64_neonx2   317857040       156045525       98360721        230976900       125668264       102536359       104731363       87928541        65965877        57027894        
fastest          aarch64_neonx2  aarch64_neon    aarch64_neonx2  aarch64_neonx2  aarch64_neon    aarch64_neon    aarch64_neonx2  aarch64_neonx2  aarch64_neonx2  aarch64_neonx2

rdolbeau · 2016-10-02T17:49:04Z

@behlendorf Which kind of core does you hardware use? The NEON number are quite good compared to the original/scalar, so probably something with the full 128 bits, i.e. neither a X-Gene nor a ThunderX.

... so I guess the question really is, A53 or A57? :-) (though at least Qualcomm and Apple have custom cores as well).

Edit: indeed, the relative ratio between original & scalar for gen_p and rec_pqr are surprising.

behlendorf · 2016-10-02T18:10:51Z

@rdolbeau I'm using an ODROID-C2 as a test system, I've added it as an aarch64 builder to the buildbot. It uses an Amlogic ARM® Cortex®-A53(ARMv8) 1.5Ghz quad core CPU.

Edit: @ironMann due to the discrepancy between scalar and original we may want to keep original around.

ironMann · 2016-10-02T18:42:54Z

@behlendorf New framework performs calculations row-wise, in contrast to original column-wise calculation. This is to make less impact on CPU caches and to better utilize CPU core I/O bandwidth. BUT, this to work, CPU hardware prefetcher must do its part. This usually works for typical consumer or server grade x86 CPU. This effect is more evident on less computationally complex operations, like gen_p and rec_p. I've even seen an ARM CPU with hw prefetchers disabled due to an errata (iMX6 Quad)

However, ABD forces change back to column-based calculations, due to kmap() peculiarities. I expect scalar and original to be within few percents. Another option is to do calculations page-by-page on raid-z columns, which might be better for CPU caches for large IOs.

behlendorf · 2016-10-03T17:57:43Z

@ironMann thanks for the explanation for the performance discrepancy. We'll have to see how the ABD changes impact performance. In the meanwhile this has been merged. @rdolbeau thank you for sticking with this patch and keeping it up to date!

rdolbeau · 2016-10-03T18:03:31Z

Nice, my first commit in ZFS :-)

Now, onward with the weird hardware... AVX512 is in #5219

fire · 2016-11-27T06:10:54Z

@rdolbeau Can you repost the dir_emit_dots patch?

rdolbeau · 2016-11-27T11:55:38Z

Here it is
dir_emit_dots.patch.zip

fire · 2016-11-27T15:53:43Z

Zfs seems to work on tegra tx1. Will you submit the patch so that future systems using an older kernel will be compatible?

Also there is a bit of duplication in the patch.

Thank you!

behlendorf · 2016-11-29T19:18:20Z

@rdolbeau @fire there already exists compatibility code for dir_emit_dots() in include/sys/zpl.h. However, it only get's enabled when neither HAVE_VFS_ITERATE or HAVE_VFS_ITERATE_SHARED is defined. So what really needs to happen is a new configure check should be added explicitly to detect dir_emit_dots().

rdolbeau mentioned this pull request Jun 26, 2016

NEON vectorized RAID-Z1/2/3 parity computation for ARM64. #4488

Closed

rdolbeau force-pushed the simd-neon branch from 6654e60 to 2c719b1 Compare July 5, 2016 07:31

rdolbeau force-pushed the simd-neon branch 2 times, most recently from fba9cea to 2ff52b1 Compare July 14, 2016 11:39

rdolbeau mentioned this pull request Jul 14, 2016

[RFC] SIMD implementation of vdev_raidz generate and reconstruct routines #4328

Closed

rdolbeau force-pushed the simd-neon branch from 2ff52b1 to 7b4e024 Compare September 17, 2016 06:23

rlaager suggested changes Sep 20, 2016

View reviewed changes

rdolbeau force-pushed the simd-neon branch from 7b4e024 to cce3a14 Compare September 21, 2016 13:41

rlaager approved these changes Sep 21, 2016

View reviewed changes

ironMann suggested changes Sep 21, 2016

View reviewed changes

rdolbeau force-pushed the simd-neon branch from 49a6603 to ffb90b7 Compare September 22, 2016 17:09

behlendorf approved these changes Sep 22, 2016

View reviewed changes

rdolbeau force-pushed the simd-neon branch from ffb90b7 to 070a652 Compare September 30, 2016 06:45

Widen fields in kstat/zfs/vdev_raidz_bench to fit longer names such a…

56a70a5

…s aarch64_neonx2

behlendorf merged commit 62a65a6 into openzfs:master Oct 3, 2016

This is the parity generation/rebuild using 128-bits NEON for Aarch64. #4801

This is the parity generation/rebuild using 128-bits NEON for Aarch64. #4801

Conversation

rdolbeau commented Jun 26, 2016

ironMann commented Jun 28, 2016

rdolbeau commented Jun 28, 2016

ironMann commented Jun 28, 2016

rdolbeau commented Jun 29, 2016

ironMann commented Jun 29, 2016

behlendorf commented Sep 17, 2016

rdolbeau commented Sep 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlaager commented Sep 21, 2016

ironMann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironMann commented Sep 22, 2016

behlendorf left a comment

Choose a reason for hiding this comment

rdolbeau commented Sep 22, 2016

behlendorf commented Sep 22, 2016 • edited Loading

rdolbeau commented Sep 22, 2016

rdolbeau commented Sep 22, 2016

behlendorf commented Sep 22, 2016

rdolbeau commented Sep 30, 2016

behlendorf commented Sep 30, 2016

ironMann commented Oct 1, 2016

behlendorf commented Oct 2, 2016

rdolbeau commented Oct 2, 2016 • edited Loading

behlendorf commented Oct 2, 2016 • edited Loading

ironMann commented Oct 2, 2016

behlendorf commented Oct 3, 2016

rdolbeau commented Oct 3, 2016

fire commented Nov 27, 2016

rdolbeau commented Nov 27, 2016

fire commented Nov 27, 2016

behlendorf commented Nov 29, 2016

behlendorf commented Sep 22, 2016 •

edited

Loading

rdolbeau commented Oct 2, 2016 •

edited

Loading

behlendorf commented Oct 2, 2016 •

edited

Loading