Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev_random_leaf() can loop forever #6631

Closed
tcaputi opened this issue Sep 11, 2017 · 6 comments · Fixed by #6665
Closed

vdev_random_leaf() can loop forever #6631

tcaputi opened this issue Sep 11, 2017 · 6 comments · Fixed by #6665
Assignees

Comments

@tcaputi
Copy link
Contributor

tcaputi commented Sep 11, 2017

Describe the problem you're observing

While adding encrypted dataset support to ztest I cam across an issue with the mmp code. ztest began to eat 100% CPU without making any forward progress. When I examined the process with gdb, I found that vdev_random_leaf() was looping infinitely. It seems that ztest had gotten the pool into a state where it had one top-level mirror vdev with only one child which was not writable. This causes the following code to loop infinitely:

	while (!vd->vdev_ops->vdev_op_leaf) {
		child = vd->vdev_child[spa_get_random(vd->vdev_children)];

		if (!vdev_writeable(child))
			continue;

        ...

Describe how to reproduce the problem

This came up twice while doing lots of zloop runs over the course of a week. I suspect it should be fairly easy to reproduce by creating a pool with the layout I described above.

@behlendorf
Copy link
Contributor

@ofaaland can you please take a look at this. It looks to me like we want to update vdev_random_leaf() such that it fails and returns NULL when there are no valid writable vdevs.

@ofaaland
Copy link
Contributor

@behlendorf it does that now, so something else is amiss. I'll take a look.

@behlendorf
Copy link
Contributor

@ofaaland I believe it will spin if you keep hitting the if (!vdev_writeable(child)) case where it calls continue.

@ofaaland
Copy link
Contributor

@behlendorf yes, but if that happens I believe that means vdev_writeable(vd) == B_TRUE) but vdev_writeable(vd->vdev_child[X]) == B_FALSE for all X. I had believed this to be impossible, because SCL_STATE is held for the duration of the call into vdev_writeable(vd).

Perhaps I'm wrong. That is, that a vdev can be writeable when its children are not writeable, presumably because an error has not yet percolated up the vdev tree. Do you know?

Otherwise, it seems like I'll have to look at how the vdev state changes occur to be able to tell whether this is a ztest flaw, state change flaw, or vdev_random_leaf() flaw.

@behlendorf
Copy link
Contributor

I don't believe SCL_STATE must be held when transitioning an individual vdev to a different state. Although that's commonly the case. I think it only must be taken to prevent the addition/removal of a vdev.

@ofaaland
Copy link
Contributor

OK, that certainly would explain this.

ofaaland added a commit to ofaaland/zfs that referenced this issue Sep 22, 2017
Rename it as mmp_random_leaf() since it is defined in mmp.c.

The earlier implementation could end up spinning forever if a pool had a
vdev marked writeable, none of whose children were writeable.  It also
did not guarantee that if a writeable leaf vdev existed, it would be
found.

Reimplement to recursively walk the device tree to select the leaf.  It
searches the entire tree, so that a return value of (NULL) indicates
there were no usable leaves in the pool; all were either not writeable
or had pending mmp writes.

It still chooses the starting child randomly at each level of the tree,
so if the pool's devices are healthy, the mmp writes go to random leaves
with an even distribution.  This was verified by testing using
zfs_multihost_history enabled.

Fixes openzfs#6631

Signed-off-by: Olaf Faaland <[email protected]>
behlendorf pushed a commit that referenced this issue Sep 22, 2017
Rename it as mmp_random_leaf() since it is defined in mmp.c.

The earlier implementation could end up spinning forever if a pool had a
vdev marked writeable, none of whose children were writeable.  It also
did not guarantee that if a writeable leaf vdev existed, it would be
found.

Reimplement to recursively walk the device tree to select the leaf.  It
searches the entire tree, so that a return value of (NULL) indicates
there were no usable leaves in the pool; all were either not writeable
or had pending mmp writes.

It still chooses the starting child randomly at each level of the tree,
so if the pool's devices are healthy, the mmp writes go to random leaves
with an even distribution.  This was verified by testing using
zfs_multihost_history enabled.

Reviewed by: Thomas Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes #6631 
Closes #6665
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Nov 21, 2017
Rename it as mmp_random_leaf() since it is defined in mmp.c.

The earlier implementation could end up spinning forever if a pool had a
vdev marked writeable, none of whose children were writeable.  It also
did not guarantee that if a writeable leaf vdev existed, it would be
found.

Reimplement to recursively walk the device tree to select the leaf.  It
searches the entire tree, so that a return value of (NULL) indicates
there were no usable leaves in the pool; all were either not writeable
or had pending mmp writes.

It still chooses the starting child randomly at each level of the tree,
so if the pool's devices are healthy, the mmp writes go to random leaves
with an even distribution.  This was verified by testing using
zfs_multihost_history enabled.

Reviewed by: Thomas Caputi <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Giuseppe Di Natale <[email protected]>
Signed-off-by: Olaf Faaland <[email protected]>
Closes openzfs#6631 
Closes openzfs#6665
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants