-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For discussion: SSE-vectorized (should work on any X86-64 cpu) RAID-Z2 parity computation #3374
Conversation
@rdolbeau thanks for opening this. There is definitely interest in making use of the SSE-vectorized instructions on x86_64 to improve performance. In fact, at the Lustre User Group meeting in April Rick Wagner of SDSC presented some data from their new Comet system which showed this optimization would improve their performance. Here's a link to the slides, you'll want to jump down to page 25. http://cdn.opensfs.org/wp-content/uploads/2015/04/SDSC-Data-Oasis-GEn-II_Wagner.pdf You may also be interested in issue #2351 which includes a prototype implementation for using these instructions to reduce the cost of check summing. If you're interested in perusing either of these ideas I think that it could be very valuable optimization. My suggestion if you're going to tackle this would be to start by structuring the code with a debugging option so that you can optionally run both the old style and new style parity calculation. This way you'll be able to quickly build confidence that your code is working properly. Obviously getting the calculations correct here is critical. Once your convinced it's working right you can disable the cross-check parity calculations and get some performance data. The perf tool is great way to determine how much you've sped things up. But bottom line is that I've talk to quite a few people who would love to see this implemented. We just haven't had the resources available to make it happen. |
@behlendorf The presentation at LUG 2015 prompted me to write this code. I alternate between this and the C implementation by way of DKMS. I did try and compare this with the C code in temporary arrays before removing them. It "works for me" to read/write files and scrub a small test pool, alternating this and the C module. I did try measuring the "speed" with rdtscp, it seems faster but did not take into account the fpu_begin/fpu_end. ... I'm pretty sure there's some bugs left, I just can't find them :-) Computing the parity as a vector of 64 bits word is not very difficult; I think the real problem is more of a design issue: how to integrate this along with AVX-128, AVX2 & NEON in a way that is compatible with ZOL and other ZFS implementations... |
As long as the Linux specific bits are cleanly abstracted away it shouldn't be a problem for the other OpenZFS implementations. In this case, the Linux optimized versions of |
Potentially other platforms might want to re-use the code... Specifically for Linux : where can the code pick and chose the proper implementation? For regular RAID, some bit of the code picks the function once and for all. It's not difficult to reimplement (cpuid is nice and easy, and I understand ZOL onty works in 64 bits so SSE2 or NEON are always available), so we can pick between several alternatives... but I don't know where to store the function pointer(s) or where to fill them. Cordially, |
If we can structure the code such that this is possible I'm all for it. But in the end I suspect the Linux version won't end up being all that portable. I'd be happy to be wrong about that though. As for the two RAIDZ entry points they're the Even under Linux we're not always going to be able to use the optimized implementation. We currently support both aarch64 and ppc64 platforms which will need to continue to work, and at some point (fairly soon likely) 32-bit platforms will be fully supported. |
@rdolbeau to make review and testing easier could you squash your patch stack, rebase it on master, and force update this branch. |
@behlendorf I think it's done (assuming I used git properly...) |
@rdolbeau Thanks for tackling this! I'll do a more careful review early this week but in the meanwhile a few quick comments.
Like I said, more real review comments to come soon I hope. |
Just a data point. Perf stats during dd 2G to a large file in raidz1 (stock ZFS):
I'll test with this patch later. |
@thegreatgazoo il would be great to have some hard numbers if you can spare the time. |
I just realised I didn't commit some performance-related changes (since [v]pcmpgtb is a better way to compute the mask than a sequence of instructions). I'll get around to it ASAP. |
d02827c
to
96ab69d
Compare
I've pushed a slightly updated version, rebased to current master. Any comments welcome. |
@behlendorf I'm fine with both suggestions. I originally intended to split the file, but my test setup is a bit rough and I didn't want to mess with the build system. |
@rdolbeau wow, this is coming along nicely! Thanks again for working on this. I wish I could comment more deeply on the assembly specific bits but I'm a bit rusty! But I think with a little more work we'll be able to get this to a place where it's easy for me and others to verify it's correct and run some performance numbers to quantify how much this helps. |
@rdolbeau adding the new files to the build system is pretty straight forward. Just make sure you add the source files to the following Makefiles. One is for the user space build the other for the kernel. You may end up wanted to relocate some common bits in to a header.
|
@behlendorf I'm not sure I'm comfortable messing with the vdev_t and module stuff, I haven't done that kind of things in years :-) I'm splitting off the files at the moment. One step at a time :-) |
@rdolbeau I'm happy to propose a patch for the vdev_t changes if you get the asm right! That's a great deal for me! By the way, when your happy with your updated version go ahead and squash everything together and force update the branch. |
I've looked into merging this into one of our test branches, but everything is crunching way at ABD right now, and unfortunately there's a mismatch between the function calls for parity checks there. If ABD is slated to be the next big merge into master, would it be useful to figure out an adoption strategy for both PRs such that we end up with compatible calls and functionality subsequent to the ABD merge? |
The ASM should be OK; latest changes are mostly for speed (the original version was a straightforward vectorisation of the C code, the current version takes advantages of some specific features of the SSE/AVX/NEON instruction sets). It has been tested out-of-kernel and in-kernel in SSE and AVX, and to a lesser extent in AVX128. I didn't add the NEON and AVX512 code that I can't test in-kernel. Speed has been tested out-of-kernel, and it is faster there. |
R simply repeats steps 2-6 after the first 6. P is only XOR so doesn't have Cordially, Romain Dolbeau |
@sempervictus Regarding the ABD stuff, I assume you mean #3441. As far as I can tell, there's two parts to the issue: 1)The copy code. It has been replaced by specific functions calls. Since I'm not sure whether my SIMD code is significantly faster than the C, or that the copy takes a significant amount of time, it's probably not important. However I would suggest investigating the cost of replacing the old method (reading src once, writing the other 1/2/3 buffers) by what is done in ABD (specific function from src to p, then copy p to q & r), since the ABD variant might be less cache-friendly at Z2 and Z3 level (... nitpicking).
|
Unfortunately, I addition to losing @tomgarcia, the machines that we did the testing on are headed back into production. I'll have to pause our work here in testing until I can recruit and set aside systems again. |
static void vdev_raidz_pick_parity_functions(void) { | ||
vdev_raidz_generate_parity_p = &vdev_raidz_generate_parity_p_c; | ||
vdev_raidz_generate_parity_pq = &vdev_raidz_generate_parity_pq_c; | ||
vdev_raidz_generate_parity_pqr = &vdev_raidz_generate_parity_pqr_c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that these 3 are tied together, it would be easier to read if you declared a struct with these 3 function pointers, and in this func, just assign a global pointer to the appropriate struct (see zfs_metaslab_ops for an example).
This is a cool idea. My only high-level concerns are:
|
FWIW, 10.8, OS X lets 64-bit kernel extensions use vector instructions. Current actually-running-on-this-machine code (the whole tree was built vdev_raidz.c source: https://gist.github.com/a18386e45ee09cb9deec Note that the %ymm and %xmm registers get substantial use. The lines after ^_vdev_generate_parity in the disassembly might interest (The use of vector instructions is also an obvious win in sha256.c) |
2015-10-23 13:02 GMT+02:00 rottegift [email protected]:
Linux doesn't, and this was written for ZoL.
gcc vectorizes the C code just fine as well (but not in the linux kernel, Cordially, Romain Dolbeau |
2015-10-23 1:43 GMT+02:00 Matthew Ahrens [email protected]:
Cordially, Romain Dolbeau |
2015-10-23 13:17 GMT+02:00 Romain Dolbeau [email protected]:
In fact in some cases the optimized C can be faster than the inline For AVX2 the assembly is always a win IIRC (that's the original target). Cordially, Romain Dolbeau |
@rdolbeau: is there any chance you could refresh this against master and possibly the current abd_next branch? We're seeing some significant penalties testing SSDs in Z2 and Z3 configurations and are hoping to use this and the current abd2. |
+1 @tomgarcia you by chance still have the updated patch stack from #2351 around and could upload it to git ? I'm mainly using sha256 checksums and might test dedup in the near future Many thanks in advance :) |
@kernelOfTruth Yeah, I have the code under my vectorized-checksum branch. It's a bit out of date since I haven't updated since summer, but hopefully it helps. |
@tomgarcia: any chance you could refresh those branches to their relative current state? there look to be several SIMD approaches in the works and this one, at least, we've tested a fair bit, so we know it doesn't kill data. |
@tomgarcia much appreciated, thanks ! it didn't include 47a4a6fd but it surely spared me some additional changes :) |
Did the import, added conditionals for using asm/fpu/api.h instead of asm/i387.h on kernels newer than 4.2, and now i get:
during module build by DKMS. I worry when bools are of an unknown type, makes me think something is horribly awry, or that i'm trying to use headers i'm not supposed to. |
2016-03-09 21:55 GMT+01:00 RageLtMan [email protected]:
I'll try to do that over the week-end, if i can find the time. I might need Cordially, Romain Dolbeau |
if (c == rm->rm_firstdatacol) { | ||
ASSERT(ccount == pcount); | ||
i = 0; | ||
if (ccount > 7) /* ccount is unsigned */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not make ccount signed?
otherwise I'd suggest indent and braces to match the rest of ZFS style (uglier though it may be in this instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to make it signed and there's no need to use if.
Just a simple for (i = 0; i + 8 <= ccount; i += 8)
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2016-03-12 3:42 GMT+01:00 Adam Leventhal [email protected]:
In module/zfs/vdev_raidz_avx128.c:
why not make ccount signed?
rc_size is an uint64_t. rc_size/sizeof() might produce a unsigned value
larger than what can be represented in a (signed) int64_t, so ccount need
to share unsignedness with rc_size and sizeof().
2016-03-12 4:36 GMT+01:00 tuxoko [email protected]:
There's no need to make it signed and there's no need to use if.
Just a simple for (i = 0; i + 8 < ccount; i += 8) will do.
I like when use-cases are clearly defined, and the "ccount > 7" makes it
very clear even to the casual reader.
Also, written your way, someone at some point is going to think the "i+8"
is re-computed at each iteration and "optimize" it by re-writing it "i <=
ccount-8" and break the code. Code defensively :-) Also, hence the comment
to make sure anyone wanting to collapse the 'if' and 'for' loop will need
to think about negative values.
Funny thing: the code appears as above (with a '<', no equal) in the mail I
received, but as the correct '<=' when I see it in the GitHub website.
Weird. [just for completness sake in: the '<=' is needed to deal properly
with the case ccount%8 == 0, "i +7 < ccount" would work as well].
Cordially,
Romain Dolbeau
The construction of parity looks good to me. Cool stuff! Adam On Thu, Oct 22, 2015 at 4:43 PM, Matthew Ahrens [email protected]
|
2016-03-09 21:55 GMT+01:00 RageLtMan [email protected]:
Done. No conflicts :-)
That will take a while, if it happens. The parity logic is different, with Cordially, Romain Dolbeau |
Thank you sir, even without abd, we can put this through some stress to
|
This is just a proof-of-concept to gauge the interest of doing the parity computation in vector registers. It has received very little testing. Don't use on production systems, obviously :-) SSE should work with any x86-64 CPUs. AVX-128 should work on AVX-enabled CPU, i.e. Sandy Bridge and later. AVX2 should work on Haswell and later. The exact variant is picked at runtime.
This is just a proof-of-concept to gauge the interest of doing the parity computation in vector registers and foster discussion on the subject.
I don't know if this is actually faster than the C code. And the parity function should be (somehow) selectable at runtime, so that AVX-128 and AVX2 could be leveraged as well.