Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ztest: vdev.c:1746: Assertion `scn->scn_phys.scn_min_txg <= vdev_dtl_min(vd) #2302

Closed
behlendorf opened this issue May 5, 2014 · 6 comments
Labels
Component: Test Suite Indicates an issue with the test framework or a test case
Milestone

Comments

@behlendorf
Copy link
Contributor

A long standing issue which ztest hits occasionally.

5 vdevs, 7 datasets, 23 threads, 600 seconds...
Pass   1,  SIGKILL,   0 ENOSPC, 15.4% of  270M used,   8% done,    9m11s to go
Pass   2,  SIGKILL,   0 ENOSPC, 21.7% of  270M used,  18% done,    8m11s to go
Pass   3,  SIGKILL,   0 ENOSPC, 22.1% of  270M used,  20% done,    8m00s to go
Pass   4,  SIGKILL,   0 ENOSPC, 14.2% of  508M used,  30% done,    6m58s to go
Pass   5,  SIGKILL,   0 ENOSPC, 16.1% of  508M used,  36% done,    6m23s to go
Pass   6,  SIGKILL,   0 ENOSPC, 17.5% of  508M used,  45% done,    5m27s to go
Pass   7,  SIGKILL,   0 ENOSPC, 17.1% of  508M used,  47% done,    5m16s to go
Pass   8,  SIGKILL,   0 ENOSPC, 16.9% of  508M used,  52% done,    4m48s to go
Pass   9, Complete,   0 ENOSPC, 15.9% of  576M used,  65% done,    3m31s to go
Pass  10,  SIGKILL,   0 ENOSPC, 17.6% of  576M used,  75% done,    2m29s to go
ztest: ../../module/zfs/vdev.c:1746: Assertion `scn->scn_phys.scn_min_txg <= vdev_dtl_min(vd) (0x4e3 <= 0x3)' failed.
child died with signal 6
@csiden
Copy link

csiden commented May 25, 2014

It looks like we have a bug filed for this internally at Delphix. I copied George's analysis to an illumos bug here: https://www.illumos.org/issues/4890 From what I can tell we do not have a fix for this yet.

nedbass added a commit to nedbass/zfs that referenced this issue Aug 18, 2014
An offline or unreadable vdev may trip the following assertion
during a resilver scan in ztest:

ztest: ../../module/zfs/vdev.c:1746: Assertion
`scn->scn_phys.scn_min_txg <= vdev_dtl_min(vd) (0x4e3 <= 0x3)' failed.
child died with signal 6

The following analysis is by George Wilson:

 The current scan is only resilvering a few txgs [70f, 711] but yet
 this vdev has a min txg of  3. The problem is that this vdev is
 currently not readable and as a result when the scan that was doing
 the resilver it actually finished but didn't copy any of the data to
 this device.

 Now a second scan comes through and the device is still offline (ie.
 not readable) so once again this device was did not have any data
 copied over to it. This time when we check if we should excise the
 DTLs from this device we determine we should since the scan is for a
 txg much higher than the max value in this device's dtl range but we
 end up tripping over this assertion:

        /*
         * When a resilver is initiated the scan will assign the
         * scn_max_txg
         * value to the highest txg value that exists in all DTLs. If
         * this
         * device's max DTL is not part of this scan (i.e. it is not in
         * the range (scn_min_txg, scn_max_txg] then it is not eligible
         * for excision.
         */
        if (vdev_dtl_max(vd) <= scn->scn_phys.scn_max_txg) {
                ASSERT3U(scn->scn_phys.scn_min_txg, <=, vdev_dtl_min(vd));

 If the device is not readable than we don't want to ever excise any
 of its dtls so we should return B_FALSE and not even bother with
 anything further.

References: https://www.illumos.org/issues/4890

Issue openzfs#2302

Signed-off-by: Ned Bass <[email protected]>
@ryao
Copy link
Contributor

ryao commented Oct 9, 2014

I reproduced this locally and took the opportunity to examine the core dump. Contrary to George's analysis, the vdev involved is not offline according to its vdev_t.

That said, vdev_dtl_min() will return the value of (space_seg_t *)vd->vdev_dtl[DTL_MISSING].sm_root->avl_root->ss_start - 1 on older versions of the code or vd->vdev_dtl[DTL_MISSING]->rt_root->rs_start - 1 on the latest code. This implies that ->ss_start/->rs_start is 0x4. Coincidentially, TXG_INITIAL is 0x4 and we call vdev_dtl_dirty() to set that to TXG_INITIAL in spa_vdev_attach():

        vdev_dtl_dirty(newvd, DTL_MISSING, TXG_INITIAL,
            dtl_max_txg - TXG_INITIAL);

My preliminary guess is that ztest attached a vdev when it was running a scrub. That said, the pool where this was triggered has an unusual geometry:

http://dpaste.com/1JGY5XP

@behlendorf
Copy link
Contributor Author

@ryao yes there's some additional debugging on this one in #2613.

@ryao
Copy link
Contributor

ryao commented Mar 13, 2015

This might be fixed by #3172. However, we will need to stress test ztest and not see this for an extended period of time before there could be any certainty of that.

@behlendorf
Copy link
Contributor Author

Likely fixed by #4790.

@behlendorf
Copy link
Contributor Author

Fixed by #4790.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case
Projects
None yet
Development

No branches or pull requests

3 participants