-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is the parity generation/rebuild using 128-bits NEON for Aarch64. #4801
Conversation
This is technically true for SSE/AVX2 in userspace build, and IMO, we should address this by adding |
I think GCC will not use SSE registers unless you really have FP, or you're compiling at -O3 or higher. OTOH, this could change and may not apply to other compilers... My personal preference is to not rely on the compiler's behavior, and either
Compiler flags could be used, but that's a fragile solution IMHO. BTW, are you otherwise OK with my small changes to the SSE/AVX2 code ? |
@rdolbeau I admit compromises had to be made. Parity gen could be made in self-contained asm blocks, but that approach would get too unruly for reconstruction path. I refactored routines in a way that make sense for x86 and kernel ABI. If something else is needed, and final result can be achieved by tweaking compiler flags, I would say it is a way to go. Kernel build adds dozens of compiler flags anyway. Everything could be rewritten in intrinsics, to get something like new scalar implementation. This would again add more changes to build flow. Note that linux kernel raid6 does this for neon linux/neon.uc I'm fine with other changes. Also if mult. table is the same as one for SSSE3 and AVX2, it could be moved to let say |
The kernel already has the flags I think. Userland is the question - and Makefiles can easily be changes/overridden without looking at the code, hence my belief it is a fragile solution long-term. What about C variables ? It' a bit of a pain to add, but should add some resiliency.
I missed that one, thanks for the pointer.
It is the same table. Cordially, |
Variables become painful when different unrolling is required, as you've seen. I gave up on them mostly because of non-C-standardness. I figured since compiler should not do auto vectorization or other SIMD related business in these files, we can drop dependency and clobber register tracking. We also have to make sure we don't make #4799 worse. |
fba9cea
to
2ff52b1
Compare
2016-09-17 3:24 GMT+02:00 Brian Behlendorf [email protected]:
Cordially, Romain Dolbeau |
@@ -99,6 +99,8 @@ KERNEL_C = \ | |||
vdev_raidz_math_sse2.c \ | |||
vdev_raidz_math_ssse3.c \ | |||
vdev_raidz_math_avx2.c \ | |||
vdev_raidz_math_neon.c \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the next line) use spaces. They should be tabs. (I checked the raw file; this isn't just the GitHub web reviewer being wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I marked my review as approved (so it isn't blocking), but I'm not actually approving this pull request. I'm not familiar enough with the code in question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some changes required, and I would suggest only adding unrolled version for aarch64 (unless contraindicated). Also naming should better reflect the ISA, and zfs parameter man page must be updated. It would be interesting to see performance results from /proc/spl/kstat/zfs/vdev_raidz_bench
(and also include them in commit message)
.gen = RAIDZ_GEN_METHODS(neon2), | ||
.rec = RAIDZ_REC_METHODS(neon2), | ||
.is_supported = &raidz_will_neon2_work, | ||
.name = "neon2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdolbeau So far we have one implementation per instruction set, and each of them use as much regs as available.
Both of these are AArch64 NEON (ARMv8) ISA, which is required to have 32 regs. Do you anticipate some CPUs that will run non-unrolled version faster?
Also, the name should reflect this, because ARMv7 implementation is possible. Maybe "aarch64-neon" and "aarch64-neon_x2" for unrolled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual performance in NEON will vary between CPU implementations, and those are more varied than in the x86-64 world. More unrolling should be at least as fast, but as a trade-off you get a potentially longer tail loop for blocks of a sub-optimal size. So in theory, one should pick the smallest unrolling that get maximum throughput... not easy.
I suspect neon2 would be a good default in most cases.
I don't have hard number to supply unfortunately, the systems where I have kernel headers to compile are under NDA :-(
I have access to a Jetson TX1 as well, but when I do 'make deb' (it's ubuntu-based), I get this ultimately:
make[1]: Leaving directory /home/ubuntu/spl' name=spl; \ version=0.7.0-rc1; \ arch=
rpm -qp ${name}-${version}.src.rpm --qf %{arch} | tail -1`;
pkg1=${name}-${version}.${arch}.rpm;
fakeroot alien --bump=0 --scripts --to-deb $pkg1;
rm -f $pkg1
spl-0.7.0-rc1.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system
Weird...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smallest block size is 512 and all possible sizes are multiple of that, or of larger power of 2. There's no need for tail loop fixup, as long as your stride divides 512. So 2x unrolled version will not incur that penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are renamed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a line to zfs-parameter man page. Unfortunately, these will not be selectable on module load, even though they are guarantied to work by aarch64 ISA, but only after zfs module is loaded. If this is a big deal we can address it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man page modified
|
||
/* Overkill... */ | ||
#if defined(_KERNEL) | ||
#define GEN_X_DEFINE_ALL() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure this patch builds with CFLAGS="-O0" and --enable-debug. Compiler might leave all of these on stack in that case, which would break the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With -O0 and --enable-debug, it explodes since apparently "-Werror=unused-variable" is enabled in this case - and I haven't had time to track down every combinations of macros to define just the right subset of variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if those are tracked down. There's also -Wframe-larger-than
that's known to create issues with -O0
because the stack is not adjusted at all on that opt level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should mostly work now even at -O0
"the frame size of 1104 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]" - the default is 1024. If I put 1280, then the error go away (obviously :-) and then everything seems fine with both -O0 and --enable-debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have run into this problem before, with -O0. IMO, you can apply same workaround, because it's artificial frame, and in a leaf function. See this pragma for fix. We don't want to break unoptimized build, and low frame size restriction is valid for other code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the pragma
#include <sys/types.h> | ||
|
||
#if defined(_KERNEL) | ||
#include <asm/cpufeature.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For aarch64 you need to include asm/neon.h
, define kfpu_begin()
as kernel_neon_begin()
, and kfpu_end()
as kernel_neon_end()
<asm/fpu/api.h>
is a x86 thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed.
@@ -34,6 +34,8 @@ static const char *raidz_impl_names[] = { | |||
"sse2", | |||
"ssse3", | |||
"avx2", | |||
"neon", | |||
"neon2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any hardware to test this patch, but it looks good. Please make sure you run cmd/raidz_test/raidz_test -S
to unit-test this on supporting hardware. You can add -v
for more verbose output. This is essentially what raidz_002_pos
test will run for 5min, but we don't have any arm64 testers either.
also: see the other comment about naming, to avoid potential confusion with arm32 neon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run raidz_test -S on a variety of hardware with no issue, since it's a pure userland binary it's much easier.
-v is not much use, since the raidz_test is multithreaded... and some of those systems have a lot of cores ;-) - the output is quite mangled.
This LGTM 👍 . Too bad there are no arm64 tester machines... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Architecturally this LGTM. I should have access to some aarch64 hardware in a week or two to get performance results from and do a little manual testing. Assuming everything looks good I'll merge this then. I'll also add the aarch64 hardware to the buildbot to improve the automated coverage.
I've merged everything into a single commit with a slightly updated commit If I need to drop neon (and only keep neonx2), then I think I'll keep that
If I could figure out why I can't build the packages on ubuntu (see an Cordially, Romain Dolbeau |
That's strange. Hopefully someone will find some time to look in to it. But in the meanwhile you don't actually need to build packages. You can use the in-tree build directions described on the wiki. Then just run |
2016-09-22 20:01 GMT+02:00 Brian Behlendorf [email protected]:
For some reason I thought I needed to install spl. Anyway, I ran into a different kind of problem: CC [M] /home/ubuntu/rdolbeau_zfs/module/zfs/zpl_ctldir.o The box is running "3.10.96-tegra", and google tells me 'dir_emit_dots' ... but I sort of thought, that kernel would be recent enough to run ZFS? Cordially, Romain Dolbeau |
2016-09-22 20:29 GMT+02:00 Romain Dolbeau [email protected]:
After replacing 3 instances of 'dir_emit_dots' by the (hopefully :-) ubuntu@lionheart30:~$ cat /proc/spl/kstat/zfs/vdev_raidz_bench Cordially, Romain Dolbeau |
Yeah in practice the kernel version number doesn't tell you much about the kernel. Enterprise kernels are terrible in this regard, only being surpassed by manufacturer supplied arm kernels! It looks like in this case the configure checks didn't get things quite right. |
This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitely disable vector registers, they have to be locally re-enabled explicitely. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more.
Rebased. |
@rdolbeau thanks. My aarch64 hardware arrived yesterday so I'll give your patch a spin over the weekend and if all looks good we can get it merged. |
@rdolbeau One last thing, could you please increase kstat header and column sizes (from 12) so that Related to this, the platform might also benefit from Fletcher4 neon implementation. The code should be very similar to ssse3 version (when PR #5164 lands). |
Things are looking good in my local testing. No issues observed with multiple full sweeps of The only surprising thing so far has been that the original implementation significantly out performs the scalar implementation for some of the benchmarks. Since we always pick the fastest version available this isn't an issue. But I think it is a nice illustration of how optimizing for one architecture can negatively impact others.
|
@behlendorf Which kind of core does you hardware use? The NEON number are quite good compared to the original/scalar, so probably something with the full 128 bits, i.e. neither a X-Gene nor a ThunderX. ... so I guess the question really is, A53 or A57? :-) (though at least Qualcomm and Apple have custom cores as well). Edit: indeed, the relative ratio between original & scalar for gen_p and rec_pqr are surprising. |
@rdolbeau I'm using an ODROID-C2 as a test system, I've added it as an aarch64 builder to the buildbot. It uses an Amlogic ARM® Cortex®-A53(ARMv8) 1.5Ghz quad core CPU. Edit: @ironMann due to the discrepancy between scalar and original we may want to keep original around. |
@behlendorf New framework performs calculations row-wise, in contrast to original column-wise calculation. This is to make less impact on CPU caches and to better utilize CPU core I/O bandwidth. BUT, this to work, CPU hardware prefetcher must do its part. This usually works for typical consumer or server grade x86 CPU. This effect is more evident on less computationally complex operations, like gen_p and rec_p. I've even seen an ARM CPU with hw prefetchers disabled due to an errata (iMX6 Quad) However, ABD forces change back to column-based calculations, due to |
Nice, my first commit in ZFS :-) Now, onward with the weird hardware... AVX512 is in #5219 |
@rdolbeau Can you repost the dir_emit_dots patch? |
Here it is |
Zfs seems to work on tegra tx1. Will you submit the patch so that future systems using an older kernel will be compatible? Also there is a bit of duplication in the patch. Thank you! |
@rdolbeau @fire there already exists compatibility code for |
This re-use the framework established for SSE and AVX2.
However, GCC is using FP registers on Aarch64, so unlike
SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax.
As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2).
This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE and AVX2 as they need the new macro.
It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.