Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

Closed
wants to merge 13 commits into from

Conversation

ironMann
Copy link
Contributor

@ironMann ironMann commented Aug 24, 2016

Enable zol simd on top of #5009

TODO:

  • use SIMD raidz with ABD
  • use SIMD fletcher4 with ABD
  • init userspace PAGESIZE in .ctor
  • use kmap_atomic() for raidz iterators. Preemption is already disabled during simd calculation, and kmap() is "prone to deadlocks when using in a nested fashion". Also, performance.
  • remove original raidz code? ABD buffers add lot of complexity to vdev_raidz.c
  • generic abd in vdev_label.c
  • remove dependency on linear abd

@ironMann
Copy link
Contributor Author

ironMann commented Aug 24, 2016

@behlendorf In 6445186 I used function constructor attribute to initialize PAGESIZE for userspace. It's supposed to be portable. What do you think?

@tuxoko
Copy link
Contributor

tuxoko commented Aug 26, 2016

For some reason, this abd version is using kmap instead of kmap_atomic?
kmap can block, which means it cannot be used in preempt disabled places.

Edit, also it use fixed size chunk which lost the physical contiguous page merging optimization.

@behlendorf
Copy link
Contributor

constructor attribute to initialize PAGESIZE for userspace.

The constructors are portable but usually I'm not a big fan. Unless you're already familiar with the code base they can make it unclear how something is getting initialized. Or worse why a function which isn't called anywhere from main() is getting run. This sort of thing has caused us problems in the past.

That said, they are really useful so if we can keep their usage minimal and fast I'm OK with it.

@ironMann
Copy link
Contributor Author

@tuxoko I don't know much about how this patchset came to be. It does not have kmap_atomic() and other optimizations, perhaps they are not needed on the illumos kernel.
I'm not sure is it worth adding them now, until the upstream abd patch stabilizes.

@@ -412,13 +433,18 @@ vdev_raidz_map_alloc(zio_t *zio, uint64_t unit_shift, uint64_t dcols,
ASSERT3U(rm->rm_nskip, <=, nparity);

for (c = 0; c < rm->rm_firstdatacol; c++)
rm->rm_col[c].rc_data = zio_buf_alloc(rm->rm_col[c].rc_size);
rm->rm_col[c].rc_abd =
abd_alloc_linear(rm->rm_col[c].rc_size, B_FALSE);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we still using linear buffer for parity? That means, for example, for each 16M block write on a raidz2 8+2 vdev, we are here allocating 2 2M linear buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to scatter. But they probably started from older version and didn't sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, original parity code in vdev_raidz.c still need linear buffer there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the illumos patch raidz code was very different from our since we vectorized the raidz code. We did a quick and dirty get it working patch for now just to get something up and running but now we're resolving the technical debt like a proper kmap_atomic kunmap_atomic based map implementation etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raidz_vdev.c is virtually identical on illumos and zol. However, it does need significant porting effort. I guess it was easier to have linear buffer for parity methods.

@thegreatgazoo
Copy link

The commit "DLPX-40252 integrate EP-476 compressed zfs send/receive" seems to do far more than "compressed zfs send/receive". For example, it added abd.c and made vdev_raidz aware of abd. It'd be nice if the commit message matches the commit code changes.

@ironMann
Copy link
Contributor Author

@thegreatgazoo @tuxoko see PR #5009 for original linux port of this abd patch. Maybe somebody there has more answers. This is where I ported vectorized raidz and fletcher on top of that patchset.

@thegreatgazoo
Copy link

@ironMann Can you please give more details on your todo item "remove original raidz code? ABD buffers add lot of complexity to vdev_raidz.c"? I'm working on #3497 which relies on vdev_raidz.c for everything on parity (e.g. computation, verification, reconstruction, so on - all based on the raidz_map_t structure). The code will be published soon but not available now. I want to understand your planned changes and the impact on my code. Thanks!

@ironMann
Copy link
Contributor Author

@thegreatgazoo The Idea is to handle parity computation/reconstruction in vdev_raidz_math.c, and to leave raidz 'logic' bits where they are. With the addition of new raidz methods, top level parity methods (vdev_raidz_generate_parity() and vdev_raidz_reconstruct()) are 'hot-patched' to call into raidz_math_gen/rec() and ignore the rest. What we call the 'original' raidz parity implementation can still be used only if user selects it with zfs_vdev_raidz_impl module parameter explicitly. Otherwise, all raidz logic bits, such as raidz_map_t, are 100% unchanged.

Other than said concerns separations, one of the reasons for removing original parity code is that raidz3 is not handled efficiently (reconstruction using vdev_raidz_matrix_reconstruct()). Incidentally, this gets exaggerated by the abd patch.

So, if you're fine with using top level parity methods, vdev_raidz_generate_parity() and vdev_raidz_reconstruct() your work can already benefit from the vectorized implementations. Otherwise, we can see how to make that happen.

@dpquigl
Copy link
Contributor

dpquigl commented Sep 1, 2016

@thegreatgazoo It seems like something got messedup with the patches. The compressed send/recv shouldn't contain abd code. Not sure why it was merged in there.

(char *)sabd->abd_u.abd_linear.abd_buf + off;
} else {
size_t new_offset = sabd->abd_u.abd_scatter.abd_offset + off;
size_t chunkcnt = abd_scatter_chunkcnt(sabd) -

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed that chunkcnt calculation ignored the size parameter. If true, then:

  • The returned abd may have more pages than the caller actually wanted. This can be wasteful, e.g. raidz calls abd_get_offset_size(16M_zio_abd, 0, 2M) to write the 1st 2M to the 1st data drive.
  • When the returned abd is later freed, it may cause leaks as the abd may have more pages than abd->abd_size would indicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch.

@dpquigl
Copy link
Contributor

dpquigl commented Sep 2, 2016

@ironMann Ok I redid the patchset for 5009 so it has the patches separated appropriately. The git diff of the branches comes up empty so it should be exactly the same.

EDIT: I need to go back and do a commit by commit built test since I apparently messed something up.

@ironMann
Copy link
Contributor Author

ironMann commented Sep 2, 2016

@dpquigl Any particular reason for this patch stack to have 3 distinct features: compressed ARC, compressed send/recv, and ABD? I feel it would be easier to test and review them separately.

@dpquigl
Copy link
Contributor

dpquigl commented Sep 2, 2016

@ironman Unfortunately that's the way that Delphix developed them. Compressed ARC is the base they started this work on. Dan Kimmel laid out a series of something like 6 or 7 patches to do in order to get ABD to apply as closely as possible with as little deviation as possible. So in short ABD relies on compressed arc, and their compressed send/recv feature deviates the code enough that applying ABD ontop of the tree without it was a substantial amount of work. Also they did some arc refactoring as well which made it deviate quite a bit.

grwilson and others added 6 commits September 2, 2016 16:47
Authored by: George Wilson <[email protected]>
Reviewed by: Prakash Surya <[email protected]>
Reviewed by: Dan Kimmel <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: Paul Dagnelie <[email protected]>
Ported by: David Quigley <[email protected]>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS Issue: https://www.illumos.org/issues/6950
- userspace: aligned buffers. Minimum of 32B alignment is needed for AVX2. Kernel buffers are aligned 512B or more.

- add abd_get_offset_size() interface

- abd_iter_map(): fix calculation of iter_mapsize

- add abd_raidz_gen_iterate() and abd_raidz_rec_iterate()

Signed-off-by: Gvozden Neskovic <[email protected]>
@ironMann ironMann force-pushed the openzfs-abd-raidz branch 2 times, most recently from 376ac8a to 1c59781 Compare September 2, 2016 17:40
@ironMann ironMann changed the title [WIP] SIMD RAIDZ on top of openzfs-abd [WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd Sep 3, 2016
Enable vectorized raidz code on abd buffers

Signed-off-by: Gvozden Neskovic <[email protected]>
- export ABD compatible interface from fletcher_4

- add ABD fletcher_4 tests for data and metadata ABD types.

Signed-off-by: Gvozden Neskovic <[email protected]>
Signed-off-by: Gvozden Neskovic <[email protected]>
@ironMann ironMann force-pushed the openzfs-abd-raidz branch 3 times, most recently from 431dfb8 to e98a238 Compare September 7, 2016 09:56
@behlendorf
Copy link
Contributor

@ironMann now that the compressed ARC and compressed send/recv changes have been merged to master @dpquigl is going to open a new PR with the remaining ABD patches. I've asked him to apply your RAIDZ and fletcher4 patch stack on top so we can get all the patches needed to date applied in one place. That should make it easier to test and review.

Copy link

@thegreatgazoo thegreatgazoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline comments.

* plan to store this ABD in memory for a long period of time, we should
* allocate the ABD type that requires the least data copying to do the I/O.
*
* Currently this is linear ABDs, however if ldi_strategy() can ever issue I/Os

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code no longer explicitly uses linear ABDs.

@@ -1259,23 +1280,28 @@ vdev_label_sync(zio_t *zio, vdev_t *vd, int l, uint64_t txg, int flags)
*/
label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE);

vp = zio_buf_alloc(sizeof (vdev_phys_t));
vp_abd = abd_alloc_for_io(sizeof (vdev_phys_t), B_TRUE);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a linear buffer is needed here, why not just abd_alloc_linear() and abd_to_buf(), to save an additional likely allocation and copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's IO performed with this ABD I used abd_alloc_for_io(). I'm not really sure why this interface is introduced, and if metadata parameter will be used to allocate linear ABD...


rm->rm_col[c].rc_data = zio->io_data;
rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, 0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine this into the for loop below:

for (off = 0; c < acols; c++) {
    rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, off,
          rm->rm_col[c].rc_size);
    off += rm->rm_col[c].rc_size;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely valid assertion. But most of the code is adapted to ABD mechanically, line for line, as much as possible. In that way it's easier to spot possible errors that could creep in.

/* Copy the scatterlist starting at the correct offset */
(void) memcpy(&abd->abd_u.abd_scatter.abd_chunks,
&sabd->abd_u.abd_scatter.abd_chunks[new_offset / PAGESIZE],
chunkcnt * sizeof (void *));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use: sizeof(abd->abd_u.abd_scatter.abd_chunks[0]).

} else {
/* adjust good_data to point at the start of our column */
good = good_data;

offset = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: move this assignment into the for() below.

@@ -106,49 +107,27 @@
#define VDEV_RAIDZ_Q 1
#define VDEV_RAIDZ_R 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appeared that the three macros above were no longer used.

- vdev_raidz
- zio, zio_checksum
- zfs_fm
- change abd_alloc_for_io() to use abd_alloc()

Signed-off-by: Gvozden Neskovic <[email protected]>
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016
@behlendorf
Copy link
Contributor

@ironMann now that ABD is merged could you open a new PR for each of the remaining patches here which need to be reviewed and merged. I believe that's just 9939ac0 and d95a3d4. I'm closing this issue, which is now a little confusing.

@behlendorf behlendorf closed this Nov 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants