-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020
Conversation
b207761
to
bb5ef52
Compare
@behlendorf In 6445186 I used function constructor attribute to initialize PAGESIZE for userspace. It's supposed to be portable. What do you think? |
For some reason, this abd version is using kmap instead of kmap_atomic? Edit, also it use fixed size chunk which lost the physical contiguous page merging optimization. |
The constructors are portable but usually I'm not a big fan. Unless you're already familiar with the code base they can make it unclear how something is getting initialized. Or worse why a function which isn't called anywhere from main() is getting run. This sort of thing has caused us problems in the past. That said, they are really useful so if we can keep their usage minimal and fast I'm OK with it. |
@tuxoko I don't know much about how this patchset came to be. It does not have |
@@ -412,13 +433,18 @@ vdev_raidz_map_alloc(zio_t *zio, uint64_t unit_shift, uint64_t dcols, | |||
ASSERT3U(rm->rm_nskip, <=, nparity); | |||
|
|||
for (c = 0; c < rm->rm_firstdatacol; c++) | |||
rm->rm_col[c].rc_data = zio_buf_alloc(rm->rm_col[c].rc_size); | |||
rm->rm_col[c].rc_abd = | |||
abd_alloc_linear(rm->rm_col[c].rc_size, B_FALSE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we still using linear buffer for parity? That means, for example, for each 16M block write on a raidz2 8+2 vdev, we are here allocating 2 2M linear buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to scatter. But they probably started from older version and didn't sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, original parity code in vdev_raidz.c
still need linear buffer there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the illumos patch raidz code was very different from our since we vectorized the raidz code. We did a quick and dirty get it working patch for now just to get something up and running but now we're resolving the technical debt like a proper kmap_atomic kunmap_atomic based map implementation etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raidz_vdev.c
is virtually identical on illumos and zol. However, it does need significant porting effort. I guess it was easier to have linear buffer for parity methods.
The commit "DLPX-40252 integrate EP-476 compressed zfs send/receive" seems to do far more than "compressed zfs send/receive". For example, it added abd.c and made vdev_raidz aware of abd. It'd be nice if the commit message matches the commit code changes. |
@thegreatgazoo @tuxoko see PR #5009 for original linux port of this abd patch. Maybe somebody there has more answers. This is where I ported vectorized raidz and fletcher on top of that patchset. |
@ironMann Can you please give more details on your todo item "remove original raidz code? ABD buffers add lot of complexity to vdev_raidz.c"? I'm working on #3497 which relies on vdev_raidz.c for everything on parity (e.g. computation, verification, reconstruction, so on - all based on the raidz_map_t structure). The code will be published soon but not available now. I want to understand your planned changes and the impact on my code. Thanks! |
@thegreatgazoo The Idea is to handle parity computation/reconstruction in Other than said concerns separations, one of the reasons for removing original parity code is that raidz3 is not handled efficiently (reconstruction using So, if you're fine with using top level parity methods, |
@thegreatgazoo It seems like something got messedup with the patches. The compressed send/recv shouldn't contain abd code. Not sure why it was merged in there. |
(char *)sabd->abd_u.abd_linear.abd_buf + off; | ||
} else { | ||
size_t new_offset = sabd->abd_u.abd_scatter.abd_offset + off; | ||
size_t chunkcnt = abd_scatter_chunkcnt(sabd) - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed that chunkcnt calculation ignored the size parameter. If true, then:
- The returned abd may have more pages than the caller actually wanted. This can be wasteful, e.g. raidz calls abd_get_offset_size(16M_zio_abd, 0, 2M) to write the 1st 2M to the 1st data drive.
- When the returned abd is later freed, it may cause leaks as the abd may have more pages than abd->abd_size would indicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch.
@ironMann Ok I redid the patchset for 5009 so it has the patches separated appropriately. The git diff of the branches comes up empty so it should be exactly the same. EDIT: I need to go back and do a commit by commit built test since I apparently messed something up. |
@dpquigl Any particular reason for this patch stack to have 3 distinct features: compressed ARC, compressed send/recv, and ABD? I feel it would be easier to test and review them separately. |
@ironman Unfortunately that's the way that Delphix developed them. Compressed ARC is the base they started this work on. Dan Kimmel laid out a series of something like 6 or 7 patches to do in order to get ABD to apply as closely as possible with as little deviation as possible. So in short ABD relies on compressed arc, and their compressed send/recv feature deviates the code enough that applying ABD ontop of the tree without it was a substantial amount of work. Also they did some arc refactoring as well which made it deviate quite a bit. |
Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Ported by: David Quigley <[email protected]> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS Issue: https://www.illumos.org/issues/6950
- userspace: aligned buffers. Minimum of 32B alignment is needed for AVX2. Kernel buffers are aligned 512B or more. - add abd_get_offset_size() interface - abd_iter_map(): fix calculation of iter_mapsize - add abd_raidz_gen_iterate() and abd_raidz_rec_iterate() Signed-off-by: Gvozden Neskovic <[email protected]>
376ac8a
to
1c59781
Compare
6589fc5
to
7cca664
Compare
Enable vectorized raidz code on abd buffers Signed-off-by: Gvozden Neskovic <[email protected]>
- export ABD compatible interface from fletcher_4 - add ABD fletcher_4 tests for data and metadata ABD types. Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Isaac Huang <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
431dfb8
to
e98a238
Compare
@ironMann now that the compressed ARC and compressed send/recv changes have been merged to master @dpquigl is going to open a new PR with the remaining ABD patches. I've asked him to apply your RAIDZ and fletcher4 patch stack on top so we can get all the patches needed to date applied in one place. That should make it easier to test and review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some inline comments.
* plan to store this ABD in memory for a long period of time, we should | ||
* allocate the ABD type that requires the least data copying to do the I/O. | ||
* | ||
* Currently this is linear ABDs, however if ldi_strategy() can ever issue I/Os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code no longer explicitly uses linear ABDs.
@@ -1259,23 +1280,28 @@ vdev_label_sync(zio_t *zio, vdev_t *vd, int l, uint64_t txg, int flags) | |||
*/ | |||
label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE); | |||
|
|||
vp = zio_buf_alloc(sizeof (vdev_phys_t)); | |||
vp_abd = abd_alloc_for_io(sizeof (vdev_phys_t), B_TRUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a linear buffer is needed here, why not just abd_alloc_linear() and abd_to_buf(), to save an additional likely allocation and copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's IO performed with this ABD I used abd_alloc_for_io()
. I'm not really sure why this interface is introduced, and if metadata parameter will be used to allocate linear ABD...
|
||
rm->rm_col[c].rc_data = zio->io_data; | ||
rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not combine this into the for loop below:
for (off = 0; c < acols; c++) {
rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, off,
rm->rm_col[c].rc_size);
off += rm->rm_col[c].rc_size;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely valid assertion. But most of the code is adapted to ABD mechanically, line for line, as much as possible. In that way it's easier to spot possible errors that could creep in.
/* Copy the scatterlist starting at the correct offset */ | ||
(void) memcpy(&abd->abd_u.abd_scatter.abd_chunks, | ||
&sabd->abd_u.abd_scatter.abd_chunks[new_offset / PAGESIZE], | ||
chunkcnt * sizeof (void *)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use: sizeof(abd->abd_u.abd_scatter.abd_chunks[0]).
} else { | ||
/* adjust good_data to point at the start of our column */ | ||
good = good_data; | ||
|
||
offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: move this assignment into the for() below.
@@ -106,49 +107,27 @@ | |||
#define VDEV_RAIDZ_Q 1 | |||
#define VDEV_RAIDZ_R 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appeared that the three macros above were no longer used.
- vdev_raidz - zio, zio_checksum - zfs_fm - change abd_alloc_for_io() to use abd_alloc() Signed-off-by: Gvozden Neskovic <[email protected]>
e98a238
to
d95a3d4
Compare
Enable zol simd on top of #5009
TODO:
PAGESIZE
in .ctorkmap_atomic()
for raidz iterators. Preemption is already disabled during simd calculation, andkmap()
is "prone to deadlocks when using in a nested fashion". Also, performance.vdev_raidz.c