Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For discussion: SSE-vectorized (should work on any X86-64 cpu) RAID-Z2 parity computation #3374

Closed
wants to merge 1 commit into from

Conversation

rdolbeau
Copy link
Contributor

@rdolbeau rdolbeau commented May 5, 2015

This is just a proof-of-concept to gauge the interest of doing the parity computation in vector registers and foster discussion on the subject.
I don't know if this is actually faster than the C code. And the parity function should be (somehow) selectable at runtime, so that AVX-128 and AVX2 could be leveraged as well.

@behlendorf
Copy link
Contributor

@rdolbeau thanks for opening this. There is definitely interest in making use of the SSE-vectorized instructions on x86_64 to improve performance. In fact, at the Lustre User Group meeting in April Rick Wagner of SDSC presented some data from their new Comet system which showed this optimization would improve their performance. Here's a link to the slides, you'll want to jump down to page 25.

http://cdn.opensfs.org/wp-content/uploads/2015/04/SDSC-Data-Oasis-GEn-II_Wagner.pdf

You may also be interested in issue #2351 which includes a prototype implementation for using these instructions to reduce the cost of check summing. If you're interested in perusing either of these ideas I think that it could be very valuable optimization.

My suggestion if you're going to tackle this would be to start by structuring the code with a debugging option so that you can optionally run both the old style and new style parity calculation. This way you'll be able to quickly build confidence that your code is working properly. Obviously getting the calculations correct here is critical.

Once your convinced it's working right you can disable the cross-check parity calculations and get some performance data. The perf tool is great way to determine how much you've sped things up.

But bottom line is that I've talk to quite a few people who would love to see this implemented. We just haven't had the resources available to make it happen.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label May 5, 2015
@rdolbeau
Copy link
Contributor Author

rdolbeau commented May 5, 2015

@behlendorf The presentation at LUG 2015 prompted me to write this code. I alternate between this and the C implementation by way of DKMS. I did try and compare this with the C code in temporary arrays before removing them. It "works for me" to read/write files and scrub a small test pool, alternating this and the C module. I did try measuring the "speed" with rdtscp, it seems faster but did not take into account the fpu_begin/fpu_end.

... I'm pretty sure there's some bugs left, I just can't find them :-)

Computing the parity as a vector of 64 bits word is not very difficult; I think the real problem is more of a design issue: how to integrate this along with AVX-128, AVX2 & NEON in a way that is compatible with ZOL and other ZFS implementations...
Cordially,

@behlendorf
Copy link
Contributor

how to integrate this along with AVX-128, AVX2 & NEON in a way that is compatible with ZOL and other ZFS implementations...

As long as the Linux specific bits are cleanly abstracted away it shouldn't be a problem for the other OpenZFS implementations. In this case, the Linux optimized versions of vdev_raidz_generate_parity_p, vdev_raidz_generate_parity_pq, and vdev_raidz_generate_parity_pqr could just wrapped by a ifdef __linux__ . The other implementations would almost certainly need to be something platform specific here anyway.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented May 5, 2015

Potentially other platforms might want to re-use the code...

Specifically for Linux : where can the code pick and chose the proper implementation? For regular RAID, some bit of the code picks the function once and for all. It's not difficult to reimplement (cpuid is nice and easy, and I understand ZOL onty works in 64 bits so SSE2 or NEON are always available), so we can pick between several alternatives... but I don't know where to store the function pointer(s) or where to fill them.

Cordially,

@behlendorf
Copy link
Contributor

Potentially other platforms might want to re-use the code...

If we can structure the code such that this is possible I'm all for it. But in the end I suspect the Linux version won't end up being all that portable. I'd be happy to be wrong about that though.

As for the two RAIDZ entry points they're the vdev_raidz_io_start and vdev_raidz_io_done functions registered down in the vdev_raidz_ops structure. However, they do a bit more than just generating or reconstructing the parity information. So instead I'd suggest using the vdev_raidz_generate_parity and vdev_raidz_reconstruct functions as entry points.

Even under Linux we're not always going to be able to use the optimized implementation. We currently support both aarch64 and ppc64 platforms which will need to continue to work, and at some point (fairly soon likely) 32-bit platforms will be fully supported.

@behlendorf
Copy link
Contributor

@rdolbeau to make review and testing easier could you squash your patch stack, rebase it on master, and force update this branch.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented May 9, 2015

@behlendorf I think it's done (assuming I used git properly...)

@behlendorf
Copy link
Contributor

@rdolbeau Thanks for tackling this! I'll do a more careful review early this week but in the meanwhile a few quick comments.

  • When you next refresh this go ahead and rebase it on master, that'll eliminate the merge commit.
  • The style issues need to be addressed, you can easily take care of them locally with the 'make checkstyle' build target. It should exit silently if no issues were found.
  • Have you considered using the following style for the assembly? Personally, I think it makes the assembly slightly more readable (if that's possible) than using asm volatile on every line. Just a thought.
#define MAKE_CST32_SSE(reg, val) \
asm volatile( \
        "movd %0,%%"#reg : : "r"(val))\n\t" \
        "pshufd $0,%"#reg",%"#reg"\n\t");

Like I said, more real review comments to come soon I hope.

@thegreatgazoo
Copy link

Just a data point. Perf stats during dd 2G to a large file in raidz1 (stock ZFS):

Samples: 33K of event 'cycles', Event count (approx.): 10052972568                                                                                              
 23.56%  [kernel]             [k] fletcher_4_native
 10.56%  [kernel]             [k] vdev_raidz_generate_parity
  8.32%  [kernel]             [k] memmove
  8.04%  [kernel]             [k] _raw_spin_lock_irqsave

I'll test with this patch later.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Jun 5, 2015

@thegreatgazoo il would be great to have some hard numbers if you can spare the time.

@rdolbeau
Copy link
Contributor Author

I just realised I didn't commit some performance-related changes (since [v]pcmpgtb is a better way to compute the mask than a sequence of instructions). I'll get around to it ASAP.

@rdolbeau
Copy link
Contributor Author

I've pushed a slightly updated version, rebased to current master. Any comments welcome.

@rdolbeau
Copy link
Contributor Author

@behlendorf I'm fine with both suggestions. I originally intended to split the file, but my test setup is a bit rough and I didn't want to mess with the build system.

@behlendorf
Copy link
Contributor

@rdolbeau wow, this is coming along nicely! Thanks again for working on this. I wish I could comment more deeply on the assembly specific bits but I'm a bit rusty! But I think with a little more work we'll be able to get this to a place where it's easy for me and others to verify it's correct and run some performance numbers to quantify how much this helps.

@behlendorf
Copy link
Contributor

@rdolbeau adding the new files to the build system is pretty straight forward. Just make sure you add the source files to the following Makefiles. One is for the user space build the other for the kernel. You may end up wanted to relocate some common bits in to a header.

lib/libzpool/Makefile.am
module/zfs/Makefile.in

@rdolbeau
Copy link
Contributor Author

@behlendorf I'm not sure I'm comfortable messing with the vdev_t and module stuff, I haven't done that kind of things in years :-) I'm splitting off the files at the moment. One step at a time :-)

@behlendorf
Copy link
Contributor

@rdolbeau I'm happy to propose a patch for the vdev_t changes if you get the asm right! That's a great deal for me! By the way, when your happy with your updated version go ahead and squash everything together and force update the branch.

@sempervictus
Copy link
Contributor

I've looked into merging this into one of our test branches, but everything is crunching way at ABD right now, and unfortunately there's a mismatch between the function calls for parity checks there. If ABD is slated to be the next big merge into master, would it be useful to figure out an adoption strategy for both PRs such that we end up with compatible calls and functionality subsequent to the ABD merge?

@rdolbeau
Copy link
Contributor Author

The ASM should be OK; latest changes are mostly for speed (the original version was a straightforward vectorisation of the C code, the current version takes advantages of some specific features of the SSE/AVX/NEON instruction sets). It has been tested out-of-kernel and in-kernel in SSE and AVX, and to a lesser extent in AVX128. I didn't add the NEON and AVX512 code that I can't test in-kernel. Speed has been tested out-of-kernel, and it is faster there.
... I still wouldn't trust real data to the code just yet :-)

@rdolbeau
Copy link
Contributor Author

I wish I could comment more deeply on the assembly specific bits but I'm a
bit rusty!

It has become reasonably simple I think. The macro names should be
self-explanatory; for COMPUTE8_Q_SSE everything is unrolled by 8 so 4 of
each instruction:

  1. movdqa to read data [assumes everything is properly aligned]
  2. pxor to zero some registers (2 operands limitations ; only one not four
    for other implementations)
  3. pcmpgtb which create the all-one mask in byte where the MSb is 1
    (replaces the first two lines of the macro VDEV_RAIDZ_64MUL_2)
  4. paddb to shift-by-one byte-wise (there's no byte-shift in SSE/AVX)
    (replaces shift & and by 0xFE)
  5. pand to mask the 0x1d (identical to scalar)
  6. pxor by the now- masked 0x1d (identical to scalar)
  7. pxor by the input value (identical to scalar)
  8. movdqa to store data back [assumes everything is properly aligned]

R simply repeats steps 2-6 after the first 6. P is only XOR so doesn't have
steps 2-6.

Cordially,

Romain Dolbeau

@rdolbeau
Copy link
Contributor Author

@sempervictus Regarding the ABD stuff, I assume you mean #3441. As far as I can tell, there's two parts to the issue:

1)The copy code. It has been replaced by specific functions calls. Since I'm not sure whether my SIMD code is significantly faster than the C, or that the copy takes a significant amount of time, it's probably not important. However I would suggest investigating the cost of replacing the old method (reading src once, writing the other 1/2/3 buffers) by what is done in ABD (specific function from src to p, then copy p to q & r), since the ABD variant might be less cache-friendly at Z2 and Z3 level (... nitpicking).

  1. The checksum code. Basically ABD pushes it down to vdev_raidz_*_func. Since those 3 functions contains loop pretty much identical to the old ones except for pointer names, it should be trivial to change them to a set of function pointers with multiple implementations in the exact same way I've done it. The same SIMD macros and unrolled loops should be usable by fixing some variables names. The catch is if the ABD code changes the 'size' to be smaller (i.e. computing on small subsets of the buffer), then the SIMD code will be less efficient (or not used at all for really small sizes). And if it changes the alignment, then more work can be required.

@rpwagner
Copy link

Unfortunately, I addition to losing @tomgarcia, the machines that we did the testing on are headed back into production. I'll have to pause our work here in testing until I can recruit and set aside systems again.

static void vdev_raidz_pick_parity_functions(void) {
vdev_raidz_generate_parity_p = &vdev_raidz_generate_parity_p_c;
vdev_raidz_generate_parity_pq = &vdev_raidz_generate_parity_pq_c;
vdev_raidz_generate_parity_pqr = &vdev_raidz_generate_parity_pqr_c;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these 3 are tied together, it would be easier to read if you declared a struct with these 3 function pointers, and in this func, just assign a global pointer to the appropriate struct (see zfs_metaslab_ops for an example).

@ahrens
Copy link
Member

ahrens commented Oct 22, 2015

This is a cool idea. My only high-level concerns are:

  • Portability of ASM usage. I assume this asm is a GCC-ism. We also compile ZFS with Sun Studio (illumos) and clang (FreeBSD and OSX). Does this work on those platforms too? Would be nice to find something that did. If not, it need to at least be ifdef GCC
  • Getting someone who understands the ASM and the RAID-Z math (@ahl?) to review. Unless we want to punt on the math and assume that the tests cover that.
  • Would it be possible to enable a debug/test mode (for ztest?) where we try all 4 funcs (3xASM + C) and compare the results? Not sure, might be too complicated.
  • Performance testing to show that each of the 3 new methods is useful on different hardware.

@rottegift
Copy link

FWIW, 10.8, OS X lets 64-bit kernel extensions use vector instructions.
In openzfsonosx, spl and zfs are loadable (and unloadable!) kernel
extensions. Compiling them with clang -O2 -march=native produces code
that runs (and seems to be as correct as unoptimized, no-arch 64-bit
code). Code emitted using -march=x86-64 should run on any 64-bit Mac,
and will generate vector instructions.

Current actually-running-on-this-machine code (the whole tree was built
with Apple LLVM version 7.0.0 (clang-700.1.76) with options -O2
-march=native, on a core i5 (i5-3210M, Macmini6,1):

vdev_raidz.c source: https://gist.github.com/a18386e45ee09cb9deec
vdev_raidz.o disassembly: (otool -V -t vdev_raidz.o)
https://gist.github.com/5a854d10199255566376

Note that the %ymm and %xmm registers get substantial use.

The lines after ^_vdev_generate_parity in the disassembly might interest
you.

(The use of vector instructions is also an obvious win in sha256.c)

@rdolbeau
Copy link
Contributor Author

2015-10-23 13:02 GMT+02:00 rottegift [email protected]:

FWIW, 10.8, OS X lets 64-bit kernel extensions use vector instructions.

Linux doesn't, and this was written for ZoL.

The lines after ^_vdev_generate_parity in the disassembly might interest
you.

gcc vectorizes the C code just fine as well (but not in the linux kernel,
of course). However, while this is fine for RAID-Z1 (i.e., XOR), it isn't
for
Z2 and Z3 since the compiler stick to the 64 bits data type and doesn't
revert to the shorter, faster sequence of 8-bits datatype instructions.
(you can see the subq, addq, etc. in _vdev_raidz_generate_parity_pq)

Cordially,

Romain Dolbeau

@rdolbeau
Copy link
Contributor Author

2015-10-23 1:43 GMT+02:00 Matthew Ahrens [email protected]:

  • Portability of ASM usage. I assume this asm is a GCC-ism. We also
    compile ZFS with Sun Studio (illumos) and clang (FreeBSD and OSX). Does
    this work on those platforms too? Would be nice to find something that did.
    If not, it need to at least be ifdef GCC

It's known to compile with ICC and CLANG as well. They both support
GCC-style extended inline asm.

  • Getting someone who understands the ASM and the RAID-Z math (@ahl
    https://github.com/ahl?) to review. Unless we want to punt on the
    math and assume that the tests cover that.

Some explanations of the code are in my comment of june 26.

Cordially,

Romain Dolbeau

@rdolbeau
Copy link
Contributor Author

2015-10-23 13:17 GMT+02:00 Romain Dolbeau [email protected]:

gcc vectorizes the C code just fine as well (but not in the linux kernel,
of course).

In fact in some cases the optimized C can be faster than the inline
assembly, probably since the compiler can schedule better. Intrinsics
would partially solve that problem, but don't work in the linux kernel
either :-(

For AVX2 the assembly is always a win IIRC (that's the original target).

Cordially,

Romain Dolbeau

@sempervictus
Copy link
Contributor

@rdolbeau: is there any chance you could refresh this against master and possibly the current abd_next branch? We're seeing some significant penalties testing SSDs in Z2 and Z3 configurations and are hoping to use this and the current abd2.
I did a manual merge, using your abd2 code in module/zfs/vdev_raidz.c for parity calculations, and merging all of the conflicts as git checkout --theirs for any file which git blame did not mark as touched by you. Its building now, we'll see what it does. I'm sure i screwed up something, somewhere, in the very least by not using the updated abd interfaces from recent changes by @tuxoko.
Thanks for the PR, and more so if you can find the time and effort to sync to current as your efforts are sure to be more effective than mine.

@kernelOfTruth
Copy link
Contributor

+1

@tomgarcia you by chance still have the updated patch stack from #2351 around and could upload it to git ?

#3374 (comment)

I'm mainly using sha256 checksums and might test dedup in the near future

Many thanks in advance :)

@tomgarcia
Copy link

@kernelOfTruth Yeah, I have the code under my vectorized-checksum branch. It's a bit out of date since I haven't updated since summer, but hopefully it helps.

@sempervictus
Copy link
Contributor

@tomgarcia: any chance you could refresh those branches to their relative current state? there look to be several SIMD approaches in the works and this one, at least, we've tested a fair bit, so we know it doesn't kill data.

@kernelOfTruth
Copy link
Contributor

@tomgarcia much appreciated, thanks !

it didn't include 47a4a6fd but it surely spared me some additional changes :)

@sempervictus
Copy link
Contributor

Did the import, added conditionals for using asm/fpu/api.h instead of asm/i387.h on kernels newer than 4.2, and now i get:

In file included from /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_raidz_sse.c:29:0:
./arch/x86/include/asm/fpu/api.h:27:1: error: unknown type name ‘bool’
 extern bool irq_fpu_usable(void);
 ^
./arch/x86/include/asm/fpu/api.h:46:30: error: unknown type name ‘u64’
 extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name);
                              ^

during module build by DKMS. I worry when bools are of an unknown type, makes me think something is horribly awry, or that i'm trying to use headers i'm not supposed to.

@rdolbeau
Copy link
Contributor Author

2016-03-09 21:55 GMT+01:00 RageLtMan [email protected]:

@rdolbeau: is there any chance you could refresh this against master and
possibly the current abd_next branch? We're seeing some significant
penalties testing SSDs in Z2 and Z3 configurations and are hoping to use
this and the current abd2.

I'll try to do that over the week-end, if i can find the time. I might need
to rebuild my test system, I haven't used it in a while.

Cordially,

Romain Dolbeau

if (c == rm->rm_firstdatacol) {
ASSERT(ccount == pcount);
i = 0;
if (ccount > 7) /* ccount is unsigned */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make ccount signed?

otherwise I'd suggest indent and braces to match the rest of ZFS style (uglier though it may be in this instance)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to make it signed and there's no need to use if.
Just a simple for (i = 0; i + 8 <= ccount; i += 8) will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2016-03-12 3:42 GMT+01:00 Adam Leventhal [email protected]:

In module/zfs/vdev_raidz_avx128.c:
why not make ccount signed?

rc_size is an uint64_t. rc_size/sizeof() might produce a unsigned value
larger than what can be represented in a (signed) int64_t, so ccount need
to share unsignedness with rc_size and sizeof().

2016-03-12 4:36 GMT+01:00 tuxoko [email protected]:

There's no need to make it signed and there's no need to use if.
Just a simple for (i = 0; i + 8 < ccount; i += 8) will do.

I like when use-cases are clearly defined, and the "ccount > 7" makes it
very clear even to the casual reader.

Also, written your way, someone at some point is going to think the "i+8"
is re-computed at each iteration and "optimize" it by re-writing it "i <=
ccount-8" and break the code. Code defensively :-) Also, hence the comment
to make sure anyone wanting to collapse the 'if' and 'for' loop will need
to think about negative values.

Funny thing: the code appears as above (with a '<', no equal) in the mail I
received, but as the correct '<=' when I see it in the GitHub website.
Weird. [just for completness sake in: the '<=' is needed to deal properly
with the case ccount%8 == 0, "i +7 < ccount" would work as well].

Cordially,

Romain Dolbeau

@ahl
Copy link

ahl commented Mar 12, 2016

The construction of parity looks good to me. Cool stuff!

Adam

On Thu, Oct 22, 2015 at 4:43 PM, Matthew Ahrens [email protected]
wrote:

This is a cool idea. My only high-level concerns are:

  • Portability of ASM usage. I assume this asm is a GCC-ism. We also
    compile ZFS with Sun Studio (illumos) and clang (FreeBSD and OSX). Does
    this work on those platforms too? Would be nice to find something that did.
    If not, it need to at least be ifdef GCC
  • Getting someone who understands the ASM and the RAID-Z math (@ahl
    https://github.com/ahl?) to review. Unless we want to punt on the
    math and assume that the tests cover that.
  • Would it be possible to enable a debug/test mode (for ztest?) where
    we try all 4 funcs (3xASM + C) and compare the results? Not sure, might be
    too complicated.
  • Performance testing to show that each of the 3 new methods is useful
    on different hardware.


Reply to this email directly or view it on GitHub
#3374 (comment).

@rdolbeau
Copy link
Contributor Author

2016-03-09 21:55 GMT+01:00 RageLtMan [email protected]:

@rdolbeau: is there any chance you could refresh this against master

Done. No conflicts :-)

and possibly the current abd_next branch?

That will take a while, if it happens. The parity logic is different, with
per-parity (P, Q, R) instead of per-ZRAID (1, 2, 3) functions. So
everything needs to be rewritten an rechecked, including the speed logic
and all functions (SSE, AVX-128, AVX2, both AVX-512 and four NEONs). Since
ABD still seems to be a moving target, I'm not sure whether it's worth the
effort...

Cordially,

Romain Dolbeau

@sempervictus
Copy link
Contributor

Thank you sir, even without abd, we can put this through some stress to
ensure data is consistent, read it back with an unlatched version, and feed
it to our destination snapshot sinks since they write a bunch of raidz2+
data each day.
On Mar 11, 2016 5:15 AM, "Romain Dolbeau" [email protected] wrote:

2016-03-09 21:55 GMT+01:00 RageLtMan [email protected]:

@rdolbeau: is there any chance you could refresh this against master and
possibly the current abd_next branch? We're seeing some significant
penalties testing SSDs in Z2 and Z3 configurations and are hoping to use
this and the current abd2.

I'll try to do that over the week-end, if i can find the time. I might need
to rebuild my test system, I haven't used it in a while.

Cordially,

Romain Dolbeau


Reply to this email directly or view it on GitHub
#3374 (comment).

This is just a proof-of-concept to gauge the interest of doing the parity computation in vector registers.
It has received very little testing. Don't use on production systems, obviously :-)
SSE should work with any x86-64 CPUs. AVX-128 should work on AVX-enabled CPU, i.e. Sandy Bridge and later. AVX2 should work on Haswell and later. The exact variant is picked at runtime.
@behlendorf
Copy link
Contributor

@rdolbeau thanks for spurring on this work! This functionality and a new framework for supporting additional instructions has been merged to master. This should make it easier to support NEON!

ab9f4b0 SIMD implementation of vdev_raidz generate and reconstruct routines

@behlendorf behlendorf closed this Jun 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.