Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is the parity generation/rebuild using 128-bits NEON for Aarch64. #4801

Merged
merged 2 commits into from
Oct 3, 2016

Conversation

rdolbeau
Copy link
Contributor

This re-use the framework established for SSE and AVX2.
However, GCC is using FP registers on Aarch64, so unlike
SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2).

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

@ironMann
Copy link
Contributor

GCC is using FP registers on Aarch64, so unlike
SSE/AVX2 we can't rely on the registers being left alone
between ASM statements.

This is technically true for SSE/AVX2 in userspace build, and IMO, we should address this by adding -mno-sse -mno-avx2 ... compiler flags, akin to kernel build flags.
Would that work here, with -mgeneral-regs-only -mnofp -mnosimd?

@rdolbeau
Copy link
Contributor Author

This is technically true for SSE/AVX2 in userspace build,

I think GCC will not use SSE registers unless you really have FP, or you're compiling at -O3 or higher. OTOH, this could change and may not apply to other compilers...

My personal preference is to not rely on the compiler's behavior, and either

  1. write self-contained ASM with clobbers (my original implementations, mostly);
  2. explicit all inter-ASM dependencies using C variables and extended ASM syntax (this code);
  3. avoid ASM and use built-ins (my favorite, not applicable to the kernel unfortunately).

Compiler flags could be used, but that's a fragile solution IMHO.

BTW, are you otherwise OK with my small changes to the SSE/AVX2 code ?

@ironMann
Copy link
Contributor

@rdolbeau I admit compromises had to be made. Parity gen could be made in self-contained asm blocks, but that approach would get too unruly for reconstruction path. I refactored routines in a way that make sense for x86 and kernel ABI. If something else is needed, and final result can be achieved by tweaking compiler flags, I would say it is a way to go. Kernel build adds dozens of compiler flags anyway.

Everything could be rewritten in intrinsics, to get something like new scalar implementation. This would again add more changes to build flow. Note that linux kernel raid6 does this for neon linux/neon.uc

I'm fine with other changes. Also if mult. table is the same as one for SSSE3 and AVX2, it could be moved to let say vdev_raidz_math.c .

@rdolbeau
Copy link
Contributor Author

If something else is needed, and final result can be achieved by tweaking compiler flags,
I would say it is a way to go. Kernel build adds dozens of compiler flags anyway.

The kernel already has the flags I think. Userland is the question - and Makefiles can easily be changes/overridden without looking at the code, hence my belief it is a fragile solution long-term.

What about C variables ? It' a bit of a pain to add, but should add some resiliency.

Note that linux kernel raid6 does this for neon linux/neon.uc

I missed that one, thanks for the pointer.

I'm fine with other changes. Also if mult. table is the same as one for SSSE3 and AVX2,
it could be moved to let say vdev_raidz_math.c .

It is the same table.

Cordially,

@ironMann
Copy link
Contributor

hence my belief it is a fragile solution long-term.
What about C variables ? It' a bit of a pain to add, but should add some resiliency.

Variables become painful when different unrolling is required, as you've seen. I gave up on them mostly because of non-C-standardness. I figured since compiler should not do auto vectorization or other SIMD related business in these files, we can drop dependency and clobber register tracking.
For a proper fix I would go for intrinsic based solution.

We also have to make sure we don't make #4799 worse.

@behlendorf
Copy link
Contributor

@rdolbeau sorry about this PR not getting any attention. Could you rebase it against master to resolve the conflicts.

@ironMann @tuxoko can you please review this since you're already familiar with this area of the code. I'll look it over as well and put it through its paces on an arm system.

@rdolbeau
Copy link
Contributor Author

2016-09-17 3:24 GMT+02:00 Brian Behlendorf [email protected]:

@rdolbeau https://github.com/rdolbeau sorry about this PR not getting
any attention. Could you rebase it against master to resolve the conflicts.

Done. Didn't have time to test everything properly though, sorry.

Cordially,

Romain Dolbeau

@@ -99,6 +99,8 @@ KERNEL_C = \
vdev_raidz_math_sse2.c \
vdev_raidz_math_ssse3.c \
vdev_raidz_math_avx2.c \
vdev_raidz_math_neon.c \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the next line) use spaces. They should be tabs. (I checked the raw file; this isn't just the GitHub web reviewer being wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@rlaager
Copy link
Member

rlaager commented Sep 21, 2016

I marked my review as approved (so it isn't blocking), but I'm not actually approving this pull request. I'm not familiar enough with the code in question.

Copy link
Contributor

@ironMann ironMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some changes required, and I would suggest only adding unrolled version for aarch64 (unless contraindicated). Also naming should better reflect the ISA, and zfs parameter man page must be updated. It would be interesting to see performance results from /proc/spl/kstat/zfs/vdev_raidz_bench (and also include them in commit message)

.gen = RAIDZ_GEN_METHODS(neon2),
.rec = RAIDZ_REC_METHODS(neon2),
.is_supported = &raidz_will_neon2_work,
.name = "neon2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdolbeau So far we have one implementation per instruction set, and each of them use as much regs as available.
Both of these are AArch64 NEON (ARMv8) ISA, which is required to have 32 regs. Do you anticipate some CPUs that will run non-unrolled version faster?
Also, the name should reflect this, because ARMv7 implementation is possible. Maybe "aarch64-neon" and "aarch64-neon_x2" for unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual performance in NEON will vary between CPU implementations, and those are more varied than in the x86-64 world. More unrolling should be at least as fast, but as a trade-off you get a potentially longer tail loop for blocks of a sub-optimal size. So in theory, one should pick the smallest unrolling that get maximum throughput... not easy.
I suspect neon2 would be a good default in most cases.
I don't have hard number to supply unfortunately, the systems where I have kernel headers to compile are under NDA :-(
I have access to a Jetson TX1 as well, but when I do 'make deb' (it's ubuntu-based), I get this ultimately:

make[1]: Leaving directory /home/ubuntu/spl' name=spl; \ version=0.7.0-rc1; \ arch=rpm -qp ${name}-${version}.src.rpm --qf %{arch} | tail -1`;
pkg1=${name}-${version}.${arch}.rpm;
fakeroot alien --bump=0 --scripts --to-deb $pkg1;
rm -f $pkg1
spl-0.7.0-rc1.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system

Weird...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smallest block size is 512 and all possible sizes are multiple of that, or of larger power of 2. There's no need for tail loop fixup, as long as your stride divides 512. So 2x unrolled version will not incur that penalty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are renamed now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a line to zfs-parameter man page. Unfortunately, these will not be selectable on module load, even though they are guarantied to work by aarch64 ISA, but only after zfs module is loaded. If this is a big deal we can address it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man page modified


/* Overkill... */
#if defined(_KERNEL)
#define GEN_X_DEFINE_ALL() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure this patch builds with CFLAGS="-O0" and --enable-debug. Compiler might leave all of these on stack in that case, which would break the build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With -O0 and --enable-debug, it explodes since apparently "-Werror=unused-variable" is enabled in this case - and I haven't had time to track down every combinations of macros to define just the right subset of variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if those are tracked down. There's also -Wframe-larger-than that's known to create issues with -O0 because the stack is not adjusted at all on that opt level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should mostly work now even at -O0
"the frame size of 1104 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]" - the default is 1024. If I put 1280, then the error go away (obviously :-) and then everything seems fine with both -O0 and --enable-debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have run into this problem before, with -O0. IMO, you can apply same workaround, because it's artificial frame, and in a leaf function. See this pragma for fix. We don't want to break unoptimized build, and low frame size restriction is valid for other code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the pragma

#include <sys/types.h>

#if defined(_KERNEL)
#include <asm/cpufeature.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For aarch64 you need to include asm/neon.h, define kfpu_begin() as kernel_neon_begin(), and kfpu_end() as kernel_neon_end()

<asm/fpu/api.h> is a x86 thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed.

@@ -34,6 +34,8 @@ static const char *raidz_impl_names[] = {
"sse2",
"ssse3",
"avx2",
"neon",
"neon2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any hardware to test this patch, but it looks good. Please make sure you run cmd/raidz_test/raidz_test -S to unit-test this on supporting hardware. You can add -v for more verbose output. This is essentially what raidz_002_pos test will run for 5min, but we don't have any arm64 testers either.
also: see the other comment about naming, to avoid potential confusion with arm32 neon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run raidz_test -S on a variety of hardware with no issue, since it's a pure userland binary it's much easier.
-v is not much use, since the raidz_test is multithreaded... and some of those systems have a lot of cores ;-) - the output is quite mangled.

@ironMann
Copy link
Contributor

This LGTM 👍 . Too bad there are no arm64 tester machines...
@rdolbeau It would be better to squash raid-z related commits, and maybe leave simd_aarch64.h related one separate.
I'm still curious about what improvements you achieve :)

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecturally this LGTM. I should have access to some aarch64 hardware in a week or two to get performance results from and do a little manual testing. Assuming everything looks good I'll merge this then. I'll also add the aarch64 hardware to the buildbot to improve the automated coverage.

@rdolbeau
Copy link
Contributor Author

@rdolbeau It would be better to squash raid-z related commits, and maybe
leave simd_aarch64.h related one separate.

I've merged everything into a single commit with a slightly updated commit
message.

If I need to drop neon (and only keep neonx2), then I think I'll keep that
as a separate commit so that the code remains accessible in the future.

I'm still curious about what improvements you achieve :)

If I could figure out why I can't build the packages on ubuntu (see an
earlier message), I could give you hard numbers on A57s :-/
They get picked in /proc/spl/kstat/zfs/vdev_raidz_bench, but some Aarch64
CPU have pretty good scalar performance and only 64 bits NEON, obviously
the speed-up won't bet very high on those CPUs.

Cordially,

Romain Dolbeau

@behlendorf
Copy link
Contributor

behlendorf commented Sep 22, 2016

If I could figure out why I can't build the packages on ubuntu (see an earlier message),

spl-0.7.0-rc1.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system

That's strange. Hopefully someone will find some time to look in to it. But in the meanwhile you don't actually need to build packages. You can use the in-tree build directions described on the wiki. Then just run ./scripts/zfs.sh to load the modules and cat /proc/spl/kstat/zfs/vdev_raidz_bench.

@rdolbeau
Copy link
Contributor Author

2016-09-22 20:01 GMT+02:00 Brian Behlendorf [email protected]:

But in the meanwhile you don't actually need to build packages.

For some reason I thought I needed to install spl.

Anyway, I ran into a different kind of problem:

CC [M] /home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.o
/home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.c: In function
‘zpl_root_iterate’:
/home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.c:60:2: error: implicit
declaration of function ‘dir_emit_dots’
[-Werror=implicit-function-declaration]
if (!dir_emit_dots(filp, ctx))
^
cc1: some warnings being treated as errors

The box is running "3.10.96-tegra", and google tells me 'dir_emit_dots'
showed up in 3.11? And since, as for most ARM box :-(, you're sort of
locked into whatever the manufacturer supplies you, I can't update the
kernel (even if I could our admins probably wouldn't let me mess with their
carefully deployed cluster anyway :-)

... but I sort of thought, that kernel would be recent enough to run ZFS?
zfsonlinux.org claims compatibility with some 2.6? I must be missing
something.

Cordially,

Romain Dolbeau

@rdolbeau
Copy link
Contributor Author

2016-09-22 20:29 GMT+02:00 Romain Dolbeau [email protected]:

/home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.c:60:2: error: implicit
declaration of function ‘dir_emit_dots’

After replacing 3 instances of 'dir_emit_dots' by the (hopefully :-)
equivalent 'dir_emit' [patch attached], I get some hard numbers on a
Jetson TX1 with 4x A57 @ 1.73 GHz:

ubuntu@lionheart30:~$ cat /proc/spl/kstat/zfs/vdev_raidz_bench
14 0 0x01 -1 0 3033321603799016 3033322707202238
implementation gen_p gen_pq gen_pqr rec_p rec_q
rec_r rec_pq rec_pr rec_qr rec_pqr
original 374238204 108299160 41839226 403662514
79127510 12197176 16418637 6238009 6238642 4288504
scalar 106272331 79084683 51825239 113116426
62718695 48893024 53620500 44389129 31374830 27725964
aarch64_neon 610502681 296341328 116622748 544914564
211054470 165422137 156541643 131338726 87000925 74783358
aarch64_neonx2 706619294 296481412 165886500 545474897
210893224 165384561 187494049 145008566 108352914 92893608
fastest
aarch64_neonx2aarch64_neonx2aarch64_neonx2aarch64_neonx2aarch64_neonaarch64_neonaarch64_neonx2aarch64_neonx2aarch64_neonx2aarch64_neonx2

Cordially,

Romain Dolbeau

@behlendorf
Copy link
Contributor

Yeah in practice the kernel version number doesn't tell you much about the kernel. Enterprise kernels are terrible in this regard, only being surpassed by manufacturer supplied arm kernels! It looks like in this case the configure checks didn't get things quite right.

This re-use the framework established for SSE2, SSSE3 and AVX2.
However, GCC is using FP registers on Aarch64, so unlike
SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitely disable vector registers, they
have to be locally re-enabled explicitely.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.
@rdolbeau
Copy link
Contributor Author

Rebased.

@behlendorf
Copy link
Contributor

@rdolbeau thanks. My aarch64 hardware arrived yesterday so I'll give your patch a spin over the weekend and if all looks good we can get it merged.

@ironMann
Copy link
Contributor

ironMann commented Oct 1, 2016

@rdolbeau One last thing, could you please increase kstat header and column sizes (from 12) so that aarch64_neonx2 fits with some spare space (in raidz_math_kstat_headers () and raidz_math_kstat_data())

Related to this, the platform might also benefit from Fletcher4 neon implementation. The code should be very similar to ssse3 version (when PR #5164 lands).

@behlendorf
Copy link
Contributor

Things are looking good in my local testing. No issues observed with multiple full sweeps of raidz_test, zloop.sh, or through normal usage of the filesystem. Unless anyone else has concerns I'll get this merged tomorrow.

The only surprising thing so far has been that the original implementation significantly out performs the scalar implementation for some of the benchmarks. Since we always pick the fastest version available this isn't an issue. But I think it is a nice illustration of how optimizing for one architecture can negatively impact others.

implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr         
original         167379518       57881107        28023603        169461606       29013464        4189038         8989947         2438610         2540972         1716722         
scalar           27674740        27405707        25523297        30319685        23672195        23675958        22972115        23021825        17095114        12282135        
aarch64_neon     253769603       156215032       71931986        230271641       125803355       102654140       71169286        63563620        49951524        44429830        
aarch64_neonx2   317857040       156045525       98360721        230976900       125668264       102536359       104731363       87928541        65965877        57027894        
fastest          aarch64_neonx2  aarch64_neon    aarch64_neonx2  aarch64_neonx2  aarch64_neon    aarch64_neon    aarch64_neonx2  aarch64_neonx2  aarch64_neonx2  aarch64_neonx2 

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Oct 2, 2016

@behlendorf Which kind of core does you hardware use? The NEON number are quite good compared to the original/scalar, so probably something with the full 128 bits, i.e. neither a X-Gene nor a ThunderX.

... so I guess the question really is, A53 or A57? :-) (though at least Qualcomm and Apple have custom cores as well).

Edit: indeed, the relative ratio between original & scalar for gen_p and rec_pqr are surprising.

@behlendorf
Copy link
Contributor

behlendorf commented Oct 2, 2016

@rdolbeau I'm using an ODROID-C2 as a test system, I've added it as an aarch64 builder to the buildbot. It uses an Amlogic ARM® Cortex®-A53(ARMv8) 1.5Ghz quad core CPU.

Edit: @ironMann due to the discrepancy between scalar and original we may want to keep original around.

@ironMann
Copy link
Contributor

ironMann commented Oct 2, 2016

@behlendorf New framework performs calculations row-wise, in contrast to original column-wise calculation. This is to make less impact on CPU caches and to better utilize CPU core I/O bandwidth. BUT, this to work, CPU hardware prefetcher must do its part. This usually works for typical consumer or server grade x86 CPU. This effect is more evident on less computationally complex operations, like gen_p and rec_p. I've even seen an ARM CPU with hw prefetchers disabled due to an errata (iMX6 Quad)

However, ABD forces change back to column-based calculations, due to kmap() peculiarities. I expect scalar and original to be within few percents. Another option is to do calculations page-by-page on raid-z columns, which might be better for CPU caches for large IOs.

@behlendorf behlendorf merged commit 62a65a6 into openzfs:master Oct 3, 2016
@behlendorf
Copy link
Contributor

@ironMann thanks for the explanation for the performance discrepancy. We'll have to see how the ABD changes impact performance. In the meanwhile this has been merged. @rdolbeau thank you for sticking with this patch and keeping it up to date!

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Oct 3, 2016

Nice, my first commit in ZFS :-)

Now, onward with the weird hardware... AVX512 is in #5219

@fire
Copy link

fire commented Nov 27, 2016

@rdolbeau Can you repost the dir_emit_dots patch?

@rdolbeau
Copy link
Contributor Author

Here it is
dir_emit_dots.patch.zip

@fire
Copy link

fire commented Nov 27, 2016

Zfs seems to work on tegra tx1. Will you submit the patch so that future systems using an older kernel will be compatible?

Also there is a bit of duplication in the patch.

Thank you!

@behlendorf
Copy link
Contributor

@rdolbeau @fire there already exists compatibility code for dir_emit_dots() in include/sys/zpl.h. However, it only get's enabled when neither HAVE_VFS_ITERATE or HAVE_VFS_ITERATE_SHARED is defined. So what really needs to happen is a new configure check should be added explicitly to detect dir_emit_dots().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants