-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many vdev probe errors should suspend pool #16864
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Allan Jude <[email protected]>
tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh
Outdated
Show resolved
Hide resolved
7cdaa4d
to
5645a2f
Compare
@don-brady CI is very unhappy on |
It seems to be detecting pool errors in this failure case. That's concerning but I wonder if it's somehow mistaken.
|
So some of the test failures were from the fact that the resilver had not completed when the scrub was requested so the scrub request failed. That was fixed by zpool wait -t resilver. The other case is that after the scrub, some files were still showing as being corrupted. However, if the pool is scrubbed a second time, those errors go away. I've observed this phenomena in practice but not sure the underlying reason. |
Similar to what we saw in openzfs#16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
5645a2f
to
ce8012b
Compare
Motivation and Context
Similar to what we saw in #16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. I have seen raidz3 pools where there were 4 missing disks (one involved in a replacing vdev) and the pool was still online and taking IO. This case is different from #16569 in that ZED was not running so the vdev_probe() errors are driving the diagnosis here.
Description
When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
How Has This Been Tested?
Added a new test that verifies that probe errors from 4 disks on a raidz3 with a replacing vdev will suspend the pool. Before the change the pool would not suspend.
Types of changes
Checklist:
Signed-off-by
.