-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict raidz faulted vdev count #16569
Conversation
@don-brady can you rebase this on the latest master code. We cut over to using GitHub Actions fully for the CI and this change needs to include those commits so it can run the tests. |
c5a7db5
to
29955cd
Compare
Rebased to latest master branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked right before the ZTS system was running, but afterwards - sth. does not work as intended I think :/
29955cd
to
4c639ad
Compare
31a4ab5
to
2d5059b
Compare
Fixed the failing test cases. Turns out that the original fix was breaking vdev replace. |
Thanks Don |
Specifically, a child in a replacing vdev won't count when assessing the dtl during a vdev_fault() Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
2d5059b
to
ec26dbb
Compare
Specifically, a child in a replacing vdev won't count when assessing the dtl during a vdev_fault() Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#16569
Similar to what we saw in openzfs#16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
Similar to what we saw in openzfs#16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
Similar to what we saw in openzfs#16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
Motivation and Context
In a ZFS pool, leaf vdevs can be faulted after tripping a diagnostic failure (ZED). However ZFS prevents faulting a drive if it would cause data loss. For example, in a raidz2, faulting more than 2 disks would cause read failures.
In determining if a vdev can be faulted, the check in
vdev_dtl_required()
doesn't consider the case where a marginal drive is being resilvered. If the marginal disk is considered valid during the checks, then zfs can start encountering read errors when the faulted vdev count matches the parity count.Description
When a drive replacement is active, the replacing vdev should not be considered when checking if a peer can be faulted. I.e., treat it as if it is offline and cannot contribute. Specifically, a child in a replacing vdev won't count when assessing the dtl during a
vdev_fault()
.Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
How Has This Been Tested?
functional/fault/fault_limits
Example result from
fault_limits
showing that the third request to fault a leaf vdev in a raidz3 group was denied and turned instead to a degrade due to presence of a replacing vdev.Types of changes
Checklist:
Signed-off-by
.