Too many vdev probe errors should suspend pool #16864

don-brady · 2024-12-13T17:17:10Z

Motivation and Context

Similar to what we saw in #16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. I have seen raidz3 pools where there were 4 missing disks (one involved in a replacing vdev) and the pool was still online and taking IO. This case is different from #16569 in that ZED was not running so the vdev_probe() errors are driving the diagnosis here.

Description

When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Added a new test that verifies that probe errors from 4 disks on a raidz3 with a replacing vdev will suspend the pool. Before the change the pool would not suspend.

  pool: testpool 
state: SUSPENDED 
status: One or more devices are faulted in response to IO failures. 
action: Make sure the affected devices are connected, then run 'zpool clear'. 
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC 
  scan: resilvered 6.47M in 00:00:03 with 8198 errors on Fri Dec 13 16:55:12 2024 
config: 
        NAME                  STATE     READ WRITE CKSUM 
        testpool              ONLINE       0     0     0 
          raidz3-0            ONLINE       0     0     0 
            /var/tmp/dev-0    ONLINE       0     0 8.16K 
            /var/tmp/dev-1    ONLINE       0     0    70 
            /var/tmp/dev-2    ONLINE       0     0    61 
            /var/tmp/dev-3    ONLINE       0     0    48 
            /var/tmp/dev-4    ONLINE       0     0    55 
            /var/tmp/dev-5    ONLINE       0     0    62 
            /var/tmp/dev-6    ONLINE       0     0 8.11K 
            sdh1              DEGRADED   195   707     0  too many errors 
            sdh2              DEGRADED   278 4.95K     0  too many errors 
            sdh3              DEGRADED   363 9.05K     0  too many errors 
            replacing-10      ONLINE       0     0 28.3K 
              sdh4            DEGRADED   453 41.5K     0  too many errors 
              /var/tmp/dev-7  ONLINE       0     0     0 
 
errors: 8168 data errors, use '-v' for a list

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

allanjude

Reviewed-by: Allan Jude <[email protected]>

tests/zfs-tests/tests/functional/fault/suspend_on_probe_errors.ksh

module/zfs/spa.c

amotin · 2024-12-14T23:57:43Z

@don-brady CI is very unhappy on fault/suspend_on_probe_errors test.

behlendorf · 2024-12-16T17:48:41Z

CI is very unhappy on fault/suspend_on_probe_errors test.

It seems to be detecting pool errors in this failure case. That's concerning but I wonder if it's somehow mistaken.

  21:07:33.66 SUCCESS: zpool status -v testpool
  21:07:33.70 ERROR: check_pool_status testpool errors No known data errors exited 1

don-brady · 2024-12-17T17:21:50Z

So some of the test failures were from the fact that the resilver had not completed when the scrub was requested so the scrub request failed. That was fixed by zpool wait -t resilver. The other case is that after the scrub, some files were still showing as being corrupted. However, if the pool is scrubbed a second time, those errors go away.

I've observed this phenomena in practice but not sure the underlying reason.

Similar to what we saw in openzfs#16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

allanjude approved these changes Dec 13, 2024

View reviewed changes

behlendorf reviewed Dec 13, 2024

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 13, 2024

amotin reviewed Dec 13, 2024

View reviewed changes

module/zfs/spa.c Outdated Show resolved Hide resolved

don-brady force-pushed the vdev_probe_error_suspend branch from 7cdaa4d to 5645a2f Compare December 14, 2024 02:35

don-brady force-pushed the vdev_probe_error_suspend branch from 5645a2f to ce8012b Compare December 17, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many vdev probe errors should suspend pool #16864

Too many vdev probe errors should suspend pool #16864

don-brady commented Dec 13, 2024

allanjude left a comment

amotin commented Dec 14, 2024

behlendorf commented Dec 16, 2024

don-brady commented Dec 17, 2024

Too many vdev probe errors should suspend pool #16864

Are you sure you want to change the base?

Too many vdev probe errors should suspend pool #16864

Conversation

don-brady commented Dec 13, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

allanjude left a comment

Choose a reason for hiding this comment

amotin commented Dec 14, 2024

behlendorf commented Dec 16, 2024

don-brady commented Dec 17, 2024