[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

ironMann · 2016-08-24T15:26:06Z

Enable zol simd on top of #5009

TODO:

use SIMD raidz with ABD
use SIMD fletcher4 with ABD
init userspace PAGESIZE in .ctor
use kmap_atomic() for raidz iterators. Preemption is already disabled during simd calculation, and kmap() is "prone to deadlocks when using in a nested fashion". Also, performance.
remove original raidz code? ABD buffers add lot of complexity to vdev_raidz.c
generic abd in vdev_label.c
remove dependency on linear abd

ironMann · 2016-08-24T19:16:49Z

@behlendorf In 6445186 I used function constructor attribute to initialize PAGESIZE for userspace. It's supposed to be portable. What do you think?

tuxoko · 2016-08-26T20:51:35Z

For some reason, this abd version is using kmap instead of kmap_atomic?
kmap can block, which means it cannot be used in preempt disabled places.

Edit, also it use fixed size chunk which lost the physical contiguous page merging optimization.

behlendorf · 2016-08-26T21:36:22Z

constructor attribute to initialize PAGESIZE for userspace.

The constructors are portable but usually I'm not a big fan. Unless you're already familiar with the code base they can make it unclear how something is getting initialized. Or worse why a function which isn't called anywhere from main() is getting run. This sort of thing has caused us problems in the past.

That said, they are really useful so if we can keep their usage minimal and fast I'm OK with it.

ironMann · 2016-08-29T12:49:51Z

@tuxoko I don't know much about how this patchset came to be. It does not have kmap_atomic() and other optimizations, perhaps they are not needed on the illumos kernel.
I'm not sure is it worth adding them now, until the upstream abd patch stabilizes.

thegreatgazoo · 2016-08-29T18:06:59Z

module/zfs/vdev_raidz.c

@@ -412,13 +433,18 @@ vdev_raidz_map_alloc(zio_t *zio, uint64_t unit_shift, uint64_t dcols,
 	ASSERT3U(rm->rm_nskip, <=, nparity);

 	for (c = 0; c < rm->rm_firstdatacol; c++)
-		rm->rm_col[c].rc_data = zio_buf_alloc(rm->rm_col[c].rc_size);
+		rm->rm_col[c].rc_abd =
+		    abd_alloc_linear(rm->rm_col[c].rc_size, B_FALSE);


Why are we still using linear buffer for parity? That means, for example, for each 16M block write on a raidz2 8+2 vdev, we are here allocating 2 2M linear buffers.

I've changed it to scatter. But they probably started from older version and didn't sync.

Yes, original parity code in vdev_raidz.c still need linear buffer there.

So the illumos patch raidz code was very different from our since we vectorized the raidz code. We did a quick and dirty get it working patch for now just to get something up and running but now we're resolving the technical debt like a proper kmap_atomic kunmap_atomic based map implementation etc.

raidz_vdev.c is virtually identical on illumos and zol. However, it does need significant porting effort. I guess it was easier to have linear buffer for parity methods.

thegreatgazoo · 2016-08-29T18:39:25Z

The commit "DLPX-40252 integrate EP-476 compressed zfs send/receive" seems to do far more than "compressed zfs send/receive". For example, it added abd.c and made vdev_raidz aware of abd. It'd be nice if the commit message matches the commit code changes.

ironMann · 2016-08-29T18:50:57Z

@thegreatgazoo @tuxoko see PR #5009 for original linux port of this abd patch. Maybe somebody there has more answers. This is where I ported vectorized raidz and fletcher on top of that patchset.

thegreatgazoo · 2016-08-29T22:56:01Z

@ironMann Can you please give more details on your todo item "remove original raidz code? ABD buffers add lot of complexity to vdev_raidz.c"? I'm working on #3497 which relies on vdev_raidz.c for everything on parity (e.g. computation, verification, reconstruction, so on - all based on the raidz_map_t structure). The code will be published soon but not available now. I want to understand your planned changes and the impact on my code. Thanks!

ironMann · 2016-08-30T06:20:17Z

@thegreatgazoo The Idea is to handle parity computation/reconstruction in vdev_raidz_math.c, and to leave raidz 'logic' bits where they are. With the addition of new raidz methods, top level parity methods (vdev_raidz_generate_parity() and vdev_raidz_reconstruct()) are 'hot-patched' to call into raidz_math_gen/rec() and ignore the rest. What we call the 'original' raidz parity implementation can still be used only if user selects it with zfs_vdev_raidz_impl module parameter explicitly. Otherwise, all raidz logic bits, such as raidz_map_t, are 100% unchanged.

Other than said concerns separations, one of the reasons for removing original parity code is that raidz3 is not handled efficiently (reconstruction using vdev_raidz_matrix_reconstruct()). Incidentally, this gets exaggerated by the abd patch.

So, if you're fine with using top level parity methods, vdev_raidz_generate_parity() and vdev_raidz_reconstruct() your work can already benefit from the vectorized implementations. Otherwise, we can see how to make that happen.

dpquigl · 2016-09-01T04:43:23Z

@thegreatgazoo It seems like something got messedup with the patches. The compressed send/recv shouldn't contain abd code. Not sure why it was merged in there.

thegreatgazoo · 2016-09-01T19:26:56Z

module/zfs/abd.c

+		    (char *)sabd->abd_u.abd_linear.abd_buf + off;
+	} else {
+		size_t new_offset = sabd->abd_u.abd_scatter.abd_offset + off;
+		size_t chunkcnt = abd_scatter_chunkcnt(sabd) -


It seemed that chunkcnt calculation ignored the size parameter. If true, then:

The returned abd may have more pages than the caller actually wanted. This can be wasteful, e.g. raidz calls abd_get_offset_size(16M_zio_abd, 0, 2M) to write the 1st 2M to the 1st data drive.

When the returned abd is later freed, it may cause leaks as the abd may have more pages than abd->abd_size would indicate.

good catch.

dpquigl · 2016-09-02T03:14:32Z

@ironMann Ok I redid the patchset for 5009 so it has the patches separated appropriately. The git diff of the branches comes up empty so it should be exactly the same.

EDIT: I need to go back and do a commit by commit built test since I apparently messed something up.

ironMann · 2016-09-02T07:53:14Z

@dpquigl Any particular reason for this patch stack to have 3 distinct features: compressed ARC, compressed send/recv, and ABD? I feel it would be easier to test and review them separately.

dpquigl · 2016-09-02T14:02:34Z

@ironman Unfortunately that's the way that Delphix developed them. Compressed ARC is the base they started this work on. Dan Kimmel laid out a series of something like 6 or 7 patches to do in order to get ABD to apply as closely as possible with as little deviation as possible. So in short ABD relies on compressed arc, and their compressed send/recv feature deviates the code enough that applying ABD ontop of the tree without it was a substantial amount of work. Also they did some arc refactoring as well which made it deviate quite a bit.

Authored by: George Wilson <[email protected]> Reviewed by: Prakash Surya <[email protected]> Reviewed by: Dan Kimmel <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Paul Dagnelie <[email protected]> Ported by: David Quigley <[email protected]> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS Issue: https://www.illumos.org/issues/6950

…share_buf

- userspace: aligned buffers. Minimum of 32B alignment is needed for AVX2. Kernel buffers are aligned 512B or more. - add abd_get_offset_size() interface - abd_iter_map(): fix calculation of iter_mapsize - add abd_raidz_gen_iterate() and abd_raidz_rec_iterate() Signed-off-by: Gvozden Neskovic <[email protected]>

Enable vectorized raidz code on abd buffers Signed-off-by: Gvozden Neskovic <[email protected]>

- export ABD compatible interface from fletcher_4 - add ABD fletcher_4 tests for data and metadata ABD types. Signed-off-by: Gvozden Neskovic <[email protected]>

Signed-off-by: Gvozden Neskovic <[email protected]>

Signed-off-by: Isaac Huang <[email protected]>

Signed-off-by: Gvozden Neskovic <[email protected]>

behlendorf · 2016-09-13T17:55:04Z

@ironMann now that the compressed ARC and compressed send/recv changes have been merged to master @dpquigl is going to open a new PR with the remaining ABD patches. I've asked him to apply your RAIDZ and fletcher4 patch stack on top so we can get all the patches needed to date applied in one place. That should make it easier to test and review.

thegreatgazoo

Some inline comments.

thegreatgazoo · 2016-09-20T20:10:35Z

module/zfs/abd.c

+ * plan to store this ABD in memory for a long period of time, we should
+ * allocate the ABD type that requires the least data copying to do the I/O.
+ *
+ * Currently this is linear ABDs, however if ldi_strategy() can ever issue I/Os


The code no longer explicitly uses linear ABDs.

thegreatgazoo · 2016-09-20T20:22:35Z

module/zfs/vdev_label.c

@@ -1259,23 +1280,28 @@ vdev_label_sync(zio_t *zio, vdev_t *vd, int l, uint64_t txg, int flags)
 	 */
 	label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE);

-	vp = zio_buf_alloc(sizeof (vdev_phys_t));
+	vp_abd = abd_alloc_for_io(sizeof (vdev_phys_t), B_TRUE);


Since a linear buffer is needed here, why not just abd_alloc_linear() and abd_to_buf(), to save an additional likely allocation and copy?

Since there's IO performed with this ABD I used abd_alloc_for_io(). I'm not really sure why this interface is introduced, and if metadata parameter will be used to allocate linear ABD...

thegreatgazoo · 2016-09-20T20:46:46Z

module/zfs/vdev_raidz.c


-	rm->rm_col[c].rc_data = zio->io_data;
+	rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, 0,


Why not combine this into the for loop below:

for (off = 0; c < acols; c++) { rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, off, rm->rm_col[c].rc_size); off += rm->rm_col[c].rc_size; }

Completely valid assertion. But most of the code is adapted to ABD mechanically, line for line, as much as possible. In that way it's easier to spot possible errors that could creep in.

thegreatgazoo · 2016-09-20T20:49:29Z

module/zfs/abd.c

+		/* Copy the scatterlist starting at the correct offset */
+		(void) memcpy(&abd->abd_u.abd_scatter.abd_chunks,
+		    &sabd->abd_u.abd_scatter.abd_chunks[new_offset / PAGESIZE],
+		    chunkcnt * sizeof (void *));


I'd rather use: sizeof(abd->abd_u.abd_scatter.abd_chunks[0]).

thegreatgazoo · 2016-09-20T21:01:14Z

module/zfs/vdev_raidz.c

 	} else {
 		/* adjust good_data to point at the start of our column */
-		good = good_data;
-
+		offset = 0;


suggestion: move this assignment into the for() below.

thegreatgazoo · 2016-09-20T21:06:26Z

module/zfs/vdev_raidz.c

@@ -106,49 +107,27 @@
 #define	VDEV_RAIDZ_Q		1
 #define	VDEV_RAIDZ_R		2


It appeared that the three macros above were no longer used.

- vdev_raidz - zio, zio_checksum - zfs_fm - change abd_alloc_for_io() to use abd_alloc() Signed-off-by: Gvozden Neskovic <[email protected]>

behlendorf · 2016-11-30T22:37:32Z

@ironMann now that ABD is merged could you open a new PR for each of the remaining patches here which need to be reviewed and merged. I believe that's just 9939ac0 and d95a3d4. I'm closing this issue, which is now a little confusing.

ironMann force-pushed the openzfs-abd-raidz branch from b207761 to bb5ef52 Compare August 24, 2016 15:30

ironMann mentioned this pull request Aug 24, 2016

Openzfs compressedarc abd patchset WIP #5009

Closed

thegreatgazoo reviewed Aug 29, 2016
View reviewed changes

thegreatgazoo reviewed Sep 1, 2016
View reviewed changes

grwilson and others added 6 commits September 2, 2016 16:47

DLPX-40252 integrate EP-476 compressed zfs send/receive

296b93c

DLPX-45330 dcenter hang due to kmem reclaim

12f1b37

DLPX-45422 zfs_abd_scatter_enabled=0 causes assertion failure in arc_…

62ed683

…share_buf

Convert ABD to use page structures instead of kmem_cache items

f28e7fe

ironMann force-pushed the openzfs-abd-raidz branch 2 times, most recently from 376ac8a to 1c59781 Compare September 2, 2016 17:40

ironMann changed the title ~~[WIP] SIMD RAIDZ on top of openzfs-abd~~ [WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd Sep 3, 2016

ironMann force-pushed the openzfs-abd-raidz branch from 6589fc5 to 7cca664 Compare September 6, 2016 15:10

ironMann added 3 commits September 6, 2016 17:48

[ABD] Vectorized raidz

60220e3

Enable vectorized raidz code on abd buffers Signed-off-by: Gvozden Neskovic <[email protected]>

[ABD] use fletcher_4 routines natively with abd_iterate_func()

aa57aa5

- export ABD compatible interface from fletcher_4 - add ABD fletcher_4 tests for data and metadata ABD types. Signed-off-by: Gvozden Neskovic <[email protected]>

init userspace PAGESIZE in .ctor

6aa2462

Signed-off-by: Gvozden Neskovic <[email protected]>

huangheintel and others added 3 commits September 6, 2016 17:48

Added ABD page support to vdev_disk.c

fbc149d

Signed-off-by: Isaac Huang <[email protected]>

[rfc][abd] remove orignal raidz implementation

d3cd237

Signed-off-by: Gvozden Neskovic <[email protected]>

[abd] generic abd in vdev_label.c

9939ac0

Signed-off-by: Gvozden Neskovic <[email protected]>

ironMann force-pushed the openzfs-abd-raidz branch 3 times, most recently from 431dfb8 to e98a238 Compare September 7, 2016 09:56

dpquigl mentioned this pull request Sep 20, 2016

[WIP] Latest WIP PR for ABD #5135

Merged

thegreatgazoo reviewed Sep 20, 2016

View reviewed changes

[abd] remove dependency on linear abd

d95a3d4

- vdev_raidz - zio, zio_checksum - zfs_fm - change abd_alloc_for_io() to use abd_alloc() Signed-off-by: Gvozden Neskovic <[email protected]>

ironMann force-pushed the openzfs-abd-raidz branch from e98a238 to d95a3d4 Compare September 21, 2016 16:33

behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016

behlendorf closed this Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

ironMann commented Aug 24, 2016 •

edited by behlendorf

Loading

ironMann commented Aug 24, 2016 •

edited

Loading

tuxoko commented Aug 26, 2016 •

edited

Loading

behlendorf commented Aug 26, 2016

ironMann commented Aug 29, 2016

thegreatgazoo Aug 29, 2016

tuxoko Aug 29, 2016

ironMann Aug 29, 2016

dpquigl Sep 2, 2016

ironMann Sep 2, 2016

thegreatgazoo commented Aug 29, 2016

ironMann commented Aug 29, 2016

thegreatgazoo commented Aug 29, 2016

ironMann commented Aug 30, 2016

dpquigl commented Sep 1, 2016

thegreatgazoo Sep 1, 2016

ironMann Sep 2, 2016

dpquigl commented Sep 2, 2016 •

edited

Loading

ironMann commented Sep 2, 2016

dpquigl commented Sep 2, 2016

behlendorf commented Sep 13, 2016

thegreatgazoo left a comment

thegreatgazoo Sep 20, 2016

thegreatgazoo Sep 20, 2016

ironMann Sep 20, 2016

thegreatgazoo Sep 20, 2016

ironMann Sep 20, 2016

thegreatgazoo Sep 20, 2016

thegreatgazoo Sep 20, 2016

thegreatgazoo Sep 20, 2016

behlendorf commented Nov 30, 2016


		rm->rm_col[c].rc_data = zio->io_data;
		rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, 0,

		@@ -106,49 +107,27 @@
		#define VDEV_RAIDZ_Q 1
		#define VDEV_RAIDZ_R 2

[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

Conversation

ironMann commented Aug 24, 2016 • edited by behlendorf Loading

ironMann commented Aug 24, 2016 • edited Loading

tuxoko commented Aug 26, 2016 • edited Loading

behlendorf commented Aug 26, 2016

ironMann commented Aug 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thegreatgazoo commented Aug 29, 2016

ironMann commented Aug 29, 2016

thegreatgazoo commented Aug 29, 2016

ironMann commented Aug 30, 2016

dpquigl commented Sep 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpquigl commented Sep 2, 2016 • edited Loading

ironMann commented Sep 2, 2016

dpquigl commented Sep 2, 2016

behlendorf commented Sep 13, 2016

thegreatgazoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented Nov 30, 2016

ironMann commented Aug 24, 2016 •

edited by behlendorf

Loading

ironMann commented Aug 24, 2016 •

edited

Loading

tuxoko commented Aug 26, 2016 •

edited

Loading

dpquigl commented Sep 2, 2016 •

edited

Loading