Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many vdev probe errors should suspend pool #16864

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

don-brady
Copy link
Contributor

Motivation and Context

Similar to what we saw in #16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. I have seen raidz3 pools where there were 4 missing disks (one involved in a replacing vdev) and the pool was still online and taking IO. This case is different from #16569 in that ZED was not running so the vdev_probe() errors are driving the diagnosis here.

Description

When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Added a new test that verifies that probe errors from 4 disks on a raidz3 with a replacing vdev will suspend the pool. Before the change the pool would not suspend.

  pool: testpool 
state: SUSPENDED 
status: One or more devices are faulted in response to IO failures. 
action: Make sure the affected devices are connected, then run 'zpool clear'. 
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC 
  scan: resilvered 6.47M in 00:00:03 with 8198 errors on Fri Dec 13 16:55:12 2024 
config: 
        NAME                  STATE     READ WRITE CKSUM 
        testpool              ONLINE       0     0     0 
          raidz3-0            ONLINE       0     0     0 
            /var/tmp/dev-0    ONLINE       0     0 8.16K 
            /var/tmp/dev-1    ONLINE       0     0    70 
            /var/tmp/dev-2    ONLINE       0     0    61 
            /var/tmp/dev-3    ONLINE       0     0    48 
            /var/tmp/dev-4    ONLINE       0     0    55 
            /var/tmp/dev-5    ONLINE       0     0    62 
            /var/tmp/dev-6    ONLINE       0     0 8.11K 
            sdh1              DEGRADED   195   707     0  too many errors 
            sdh2              DEGRADED   278 4.95K     0  too many errors 
            sdh3              DEGRADED   363 9.05K     0  too many errors 
            replacing-10      ONLINE       0     0 28.3K 
              sdh4            DEGRADED   453 41.5K     0  too many errors 
              /var/tmp/dev-7  ONLINE       0     0     0 
 
errors: 8168 data errors, use '-v' for a list 

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@allanjude allanjude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed-by: Allan Jude <[email protected]>

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 13, 2024
module/zfs/spa.c Outdated Show resolved Hide resolved
@don-brady don-brady force-pushed the vdev_probe_error_suspend branch from 7cdaa4d to 5645a2f Compare December 14, 2024 02:35
@amotin
Copy link
Member

amotin commented Dec 14, 2024

@don-brady CI is very unhappy on fault/suspend_on_probe_errors test.

@behlendorf
Copy link
Contributor

CI is very unhappy on fault/suspend_on_probe_errors test.

It seems to be detecting pool errors in this failure case. That's concerning but I wonder if it's somehow mistaken.

  21:07:33.66 SUCCESS: zpool status -v testpool
  21:07:33.70 ERROR: check_pool_status testpool errors No known data errors exited 1

@don-brady
Copy link
Contributor Author

So some of the test failures were from the fact that the resilver had not completed when the scrub was requested so the scrub request failed. That was fixed by zpool wait -t resilver. The other case is that after the scrub, some files were still showing as being corrupted. However, if the pool is scrubbed a second time, those errors go away.

I've observed this phenomena in practice but not sure the underlying reason.

Similar to what we saw in openzfs#16569, we need to consider that a
replacing vdev should not be considered as fully contributing
to the redundancy of a raidz vdev even though current IO has
enough redundancy.

When a failed vdev_probe() is faulting a disk, it now checks
if that disk is required, and if so it suspends the pool until
the admin can return the missing disks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Don Brady <[email protected]>
@don-brady don-brady force-pushed the vdev_probe_error_suspend branch from 5645a2f to ce8012b Compare December 17, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants