Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zdb fails to import pool because asize < vdev_min_asize in draid top-level vdev #11459

Closed
mmaybee opened this issue Jan 13, 2021 · 1 comment · Fixed by #12221
Closed

zdb fails to import pool because asize < vdev_min_asize in draid top-level vdev #11459

mmaybee opened this issue Jan 13, 2021 · 1 comment · Fixed by #12221
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@mmaybee
Copy link
Contributor

mmaybee commented Jan 13, 2021

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 5.4.0-42
Architecture x86
ZFS Version 2.0++
Commit 8f158ae

Describe the problem you're observing

A zloop run failed without producing a core file. ztest.out shows that the failure comes from zdb (attempting to verify a pool) returning an error (EINVAL):

Executing zdb -bccsv -G -d -Y -e -y -p /var/tmp/os-ztest/zloop-run ztest
zdb: can't open 'ztest': Invalid argument`

% grep -i einval /usr/include/asm-generic/errno-base.h
#define EINVAL          22      /* Invalid argument */

The actual error is coming from a call to vdev_open() (which occurs prior to the call to spa_load_failed() that generated the error message). This function is returning the EINVAL error being reported. Within vdev_open(), the error is being set because asize < vdev_min_asize for a top-level vdev:

        /*
         * Make sure the allocatable size hasn't shrunk too much.
         */
        if (asize < vd->vdev_min_asize) {
                vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
                    VDEV_AUX_BAD_LABEL);
                return (SET_ERROR(EINVAL));
        }

Here are the current values being compared (note that asize comes from osize in this function):

(gdb) print osize
$21 = 4798283776
(gdb) print vd->vdev_min_asize
$22 = 4831838208

The top-level vdev here is of type draid. The asize of a draid vdev is computed by summing the asize its children (minus the space reserved for distributed spares). Note that there is a spare device currently deployed in this top-level vdev as child vdev 4:

vdev.c:195:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 7397569113689487881, path: N/A, can't open
vdev.c:195:vdev_dbgmsg_print_tree():     vdev 0: draid, guid: 4525148408276004100, path: N/A, can't open
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 0: file, guid: 16492265217803933395, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.0a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 1: file, guid: 13189481552791461187, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.1a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 2: file, guid: 1960318212727225725, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.2b, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 3: file, guid: 795303241160842783, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.3a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 4: spare, guid: 17473580923192177435, path: N/A, healthy
vdev.c:195:vdev_dbgmsg_print_tree():         vdev 0: file, guid: 5222868485229018091, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.4a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():         vdev 1: dspare, guid: 2629688768943859102, path: draid2-0-1, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 5: file, guid: 17532226906533716578, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.5a, healthy
...

Looking at the sizes for this spare and its children we see:

(gdb) print vd->vdev_child[4]->vdev_asize
$17 = 483131392
(gdb) print vd->vdev_child[4]->vdev_children
$18 = 2
(gdb) print vd->vdev_child[4]->vdev_child[0]->vdev_asize
$19 = 532152320
(gdb) print vd->vdev_child[4]->vdev_child[1]->vdev_asize
$20 = 483131392

The asize for the dspare is significantly smaller than the asize of the device it is sparing! The spare parent vdev reports the smaller vdev size. All other child vdevs in this draid report the larger size:

(gdb) set $i = 0
(gdb) p vd->vdev_child[$i++]->vdev_asize
$36 = 532152320
(gdb) 
$37 = 532152320
(gdb) 
$38 = 483131392
(gdb) 
$39 = 532152320
(gdb) 
$40 = 483131392
(gdb) 
$41 = 532152320
(gdb) 
$42 = 532152320
(gdb) 
$43 = 532152320
(gdb) 
$44 = 532152320
(gdb) 
$45 = 532152320
(gdb) 
$46 = 532152320
(gdb) 
$47 = 532152320
(gdb) 
$48 = 532152320

This difference in asize likely explains the unexpected small asize for the top-level vdev that generated this error. However, more investigation is needed to determine why the dspare has a smaller than expected size here.

Describe how to reproduce the problem

This is reproducible with zloop.

@mmaybee mmaybee added Type: Defect Incorrect behavior (e.g. crash, hang) Status: Triage Needed New issue which needs to be triaged labels Jan 13, 2021
@behlendorf behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case and removed Status: Triage Needed New issue which needs to be triaged labels Jan 13, 2021
@ahrens
Copy link
Member

ahrens commented Jun 5, 2021

The problem is with how the expected child vdev size is calculated. vdev_draid_min_asize() says that the child’s asize has to be at least 1/Nth of the entire draid’s asize, which is the same logic as raidz. However, this contradicts the code in vdev_draid_open(), which calculates the draid’s asize based on a reduced child size:

rounded down to last full row
	 * and then to the last full group. An additional 32MB of scratch
	 * space is reserved at the end of each child for use by the dRAID
	 * expansion feature

So the problem is that you can replace a draid disk with one that’s vdev_draid_min_asize(), but it actually needs to be larger to accomodate the additional 32MB + the last row rounding. The replacement is allowed and everything works at first (since the reserved space is at the end, and we don’t try to use it yet), but when you try to close and reopen the pool, vdev_draid_open() calculates a smaller asize for the draid, because of the smaller leaf.

I think the confusion is that vdev_draid_min_asize() is correctly returning the amount of required allocatable space in a leaf, but the actual SIZE of the leaf needs to be at least 32MB more than that. ztest_vdev_attach_detach() assumes that it can attach that size of device, and it actually can (the kernel/libzpool accepts it), but it then later causes this problem. I think we want vdev_draid_min_asize() to return the required size of the leaf, not the size that draid will make available to the metaslab allocator.

@ahrens ahrens removed the Component: Test Suite Indicates an issue with the test framework or a test case label Jun 5, 2021
ahrens added a commit to ahrens/zfs that referenced this issue Jun 10, 2021
ahrens added a commit to ahrens/zfs that referenced this issue Jun 10, 2021
ahrens added a commit to ahrens/zfs that referenced this issue Jun 10, 2021
vdev_draid_min_asize() returns the minimum size of a child vdev.  This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.

`vdev_draid_min_asize() says that the child’s asize has to be at least 1/Nth of
the entire draid’s asize, which is the same logic as raidz.  However, this
contradicts the code in vdev_draid_open(), which calculates the draid’s asize
based on a reduced child size:

  An additional 32MB of scratch space is reserved at the end of each child for
  use by the dRAID expansion feature

So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accomodate the
additional 32MB.  The replacement is allowed and everything works at first
(since the reserved space is at the end, and we don’t try to use it yet), but
when you try to close and reopen the pool, vdev_draid_open() calculates a
smaller asize for the draid, because of the smaller leaf, which is not allowed.

I think the confusion is that vdev_draid_min_asize() is correctly returning the
amount of required *allocatable* space in a leaf, but the actual *size* of the
leaf needs to be at least 32MB more than that.  ztest_vdev_attach_detach()
assumes that it can attach that size of device, and it actually can (the
kernel/libzpool accepts it), but it then later causes zdb to not be able to open the pool.

This commit changes vdev_draid_min_asize() to return the required size of the
leaf, not the size that draid will make available to the metaslab allocator.

Closes openzfs#11459
ahrens added a commit to ahrens/zfs that referenced this issue Jun 10, 2021
vdev_draid_min_asize() returns the minimum size of a child vdev.  This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.

`vdev_draid_min_asize() says that the child’s asize has to be at least 1/Nth of
the entire draid’s asize, which is the same logic as raidz.  However, this
contradicts the code in vdev_draid_open(), which calculates the draid’s asize
based on a reduced child size:

  An additional 32MB of scratch space is reserved at the end of each child for
  use by the dRAID expansion feature

So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accomodate the
additional 32MB.  The replacement is allowed and everything works at first
(since the reserved space is at the end, and we don’t try to use it yet), but
when you try to close and reopen the pool, vdev_draid_open() calculates a
smaller asize for the draid, because of the smaller leaf, which is not allowed.

I think the confusion is that vdev_draid_min_asize() is correctly returning the
amount of required *allocatable* space in a leaf, but the actual *size* of the
leaf needs to be at least 32MB more than that.  ztest_vdev_attach_detach()
assumes that it can attach that size of device, and it actually can (the
kernel/libzpool accepts it), but it then later causes zdb to not be able to open the pool.

This commit changes vdev_draid_min_asize() to return the required size of the
leaf, not the size that draid will make available to the metaslab allocator.

Closes openzfs#11459
Signed-off-by: Matthew Ahrens <[email protected]>
ahrens added a commit to ahrens/zfs that referenced this issue Jun 10, 2021
vdev_draid_min_asize() returns the minimum size of a child vdev.  This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.

vdev_draid_min_asize() says that the child’s asize has to be at least
1/Nth of the entire draid’s asize, which is the same logic as raidz.
However, this contradicts the code in vdev_draid_open(), which
calculates the draid’s asize based on a reduced child size:

  An additional 32MB of scratch space is reserved at the end of each
  child for use by the dRAID expansion feature

So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accomodate
the additional 32MB.  The replacement is allowed and everything works at
first (since the reserved space is at the end, and we don’t try to use
it yet), but when you try to close and reopen the pool,
vdev_draid_open() calculates a smaller asize for the draid, because of
the smaller leaf, which is not allowed.

I think the confusion is that vdev_draid_min_asize() is correctly
returning the amount of required *allocatable* space in a leaf, but the
actual *size* of the leaf needs to be at least 32MB more than that.
ztest_vdev_attach_detach() assumes that it can attach that size of
device, and it actually can (the kernel/libzpool accepts it), but it
then later causes zdb to not be able to open the pool.

This commit changes vdev_draid_min_asize() to return the required size
of the leaf, not the size that draid will make available to the metaslab
allocator.

Closes openzfs#11459
Signed-off-by: Matthew Ahrens <[email protected]>
behlendorf pushed a commit that referenced this issue Jun 13, 2021
vdev_draid_min_asize() returns the minimum size of a child vdev.  This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.

vdev_draid_min_asize() says that the child’s asize has to be at least
1/Nth of the entire draid’s asize, which is the same logic as raidz.
However, this contradicts the code in vdev_draid_open(), which
calculates the draid’s asize based on a reduced child size:

  An additional 32MB of scratch space is reserved at the end of each
  child for use by the dRAID expansion feature

So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accommodate
the additional 32MB.  The replacement is allowed and everything works at
first (since the reserved space is at the end, and we don’t try to use
it yet), but when you try to close and reopen the pool,
vdev_draid_open() calculates a smaller asize for the draid, because of
the smaller leaf, which is not allowed.

I think the confusion is that vdev_draid_min_asize() is correctly
returning the amount of required *allocatable* space in a leaf, but the
actual *size* of the leaf needs to be at least 32MB more than that.
ztest_vdev_attach_detach() assumes that it can attach that size of
device, and it actually can (the kernel/libzpool accepts it), but it
then later causes zdb to not be able to open the pool.

This commit changes vdev_draid_min_asize() to return the required size
of the leaf, not the size that draid will make available to the metaslab
allocator.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Mark Maybee <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11459
Closes #12221
behlendorf pushed a commit that referenced this issue Jun 16, 2021
vdev_draid_min_asize() returns the minimum size of a child vdev.  This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.

vdev_draid_min_asize() says that the child’s asize has to be at least
1/Nth of the entire draid’s asize, which is the same logic as raidz.
However, this contradicts the code in vdev_draid_open(), which
calculates the draid’s asize based on a reduced child size:

  An additional 32MB of scratch space is reserved at the end of each
  child for use by the dRAID expansion feature

So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accommodate
the additional 32MB.  The replacement is allowed and everything works at
first (since the reserved space is at the end, and we don’t try to use
it yet), but when you try to close and reopen the pool,
vdev_draid_open() calculates a smaller asize for the draid, because of
the smaller leaf, which is not allowed.

I think the confusion is that vdev_draid_min_asize() is correctly
returning the amount of required *allocatable* space in a leaf, but the
actual *size* of the leaf needs to be at least 32MB more than that.
ztest_vdev_attach_detach() assumes that it can attach that size of
device, and it actually can (the kernel/libzpool accepts it), but it
then later causes zdb to not be able to open the pool.

This commit changes vdev_draid_min_asize() to return the required size
of the leaf, not the size that draid will make available to the metaslab
allocator.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Mark Maybee <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11459
Closes #12221
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants