Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrub mirror children without BPs #13555

Merged
merged 1 commit into from
Jun 23, 2022
Merged

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Jun 14, 2022

Motivation and Context

Observed when testing out pathological failure scenarios this
long standing behavior was observed, see #12090 .

Description

When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read. This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children. Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid. This is because sequential rebuilds
may be used with draid and they rely solely on parity to resilver.
And because distributed spares are usually available and always
preferred, which means damage on other mirror child will go
undetected.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing. When the BP isn't available for
verification, then compare the data buffers from each child. They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children. Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

How Has This Been Tested?

Added the new redundancy_draid_damaged2 test case which
exercises this scenario. The test would reliably fail prior to this
change.

Two 400 iteration runs of the redundancy test group.

./scripts/zfs-tests.sh -T redundancy -I 400

Test: zfs/tests/zfs-tests/tests/functional/redundancy/setup (run as root) [00:00] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid (run as root) [01:54] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid1 (run as root) [00:13] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid2 (run as root) [00:19] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid3 (run as root) [00:20] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid_damaged1 (run as root) [01:32] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid_damaged2 (run as root) [01:39] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid_spare1 (run as root) [00:32] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid_spare2 (run as root) [00:11] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_draid_spare3 (run as root) [00:59] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_mirror (run as root) [00:06] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_raidz (run as root) [01:53] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_raidz1 (run as root) [00:12] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_raidz2 (run as root) [00:10] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_raidz3 (run as root) [00:14] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/redundancy_stripe (run as root) [00:04] [PASS]
Test: zfs/tests/zfs-tests/tests/functional/redundancy/cleanup (run as root) [00:00] [PASS]
...

Results Summary
PASS     6800

Running Time:   64:57:22
Percent passed: 100.0%

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Jun 15, 2022
@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jun 17, 2022
@behlendorf behlendorf requested a review from tonyhutter June 17, 2022 00:00
module/zfs/vdev_mirror.c Show resolved Hide resolved
module/zfs/vdev_mirror.c Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwritting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronouced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delievers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be indentical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Signed-off-by: Brian Behlendorf <[email protected]>
@behlendorf
Copy link
Contributor Author

Thanks for the review feedback, updated.

module/zfs/vdev_mirror.c Show resolved Hide resolved
module/zfs/vdev_mirror.c Show resolved Hide resolved
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 23, 2022
@behlendorf behlendorf merged commit ad8b9f9 into openzfs:master Jun 23, 2022
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jul 8, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
behlendorf added a commit that referenced this pull request Jul 14, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #13555
andrewc12 added a commit to andrewc12/openzfs that referenced this pull request Jul 30, 2022
commit e8cf3a4f7662f2d1c13684ce52b73ab0d9a12266
Author: Alek P <[email protected]>
Date:   Thu Jul 28 18:52:46 2022 -0400

    Implement a new type of zfs receive: corrective receive (-c)

    This type of recv is used to heal corrupted data when a replica
    of the data already exists (in the form of a send file for example).
    With the provided send stream, corrective receive will read from
    disk blocks described by the WRITE records. When any of the reads
    come back with ECKSUM we use the data from the corresponding WRITE
    record to rewrite the corrupted block.

    Reviewed-by: Paul Dagnelie <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Paul Zuchowski <[email protected]>
    Signed-off-by: Alek Pinchuk <[email protected]>
    Closes #9372

commit 5fae33e04771e255f3dba57263fd06eb68bd38b5
Author: Tino Reichardt <[email protected]>
Date:   Thu Jul 28 23:19:41 2022 +0200

    FreeBSD compile fix

    The file module/os/freebsd/zfs/zfs_ioctl_compat.c fails compiling
    because of this error: 'static' is not at beginning of declaration

    This commit fixes the three places within that file.

    Reviewed-by: Alexander Motin <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tino Reichardt <[email protected]>
    Closes #13702

commit 34aa0f0487705671c81262adb7646a90d15c5a12
Author: Brian Behlendorf <[email protected]>
Date:   Tue Jul 26 14:39:23 2022 -0700

    ZTS: Fix io_uring support check

    Not all Linux distribution kernels enable io_uring support by
    default.  Update the run time check to verify that the booted
    kernel was built with CONFIG_IO_URING=y.

    Reviewed-by: Tony Hutter <[email protected]>
    Reviewed-by: Tony Nguyen <[email protected]>
    Co-authored-by: George Melikov <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13648
    Closes #13685

commit 3a1ce4914172ce4c1e39123cd31b1e5245765a5e
Author: Ameer Hamza <[email protected]>
Date:   Tue Jul 26 02:04:46 2022 +0500

    Add createtxg sort support for simple snapshot iterator

    - When iterating snapshots with name only, e.g., "-o name -s name",
    libzfs uses simple snapshot iterator and results are displayed
    in alphabetic order. This PR adds support for faster version of
    createtxg sort by avoiding nvlist parsing for properties. Flags
    "-o name -s createtxg" will enable createtxg sort while using
    simple snapshot iterator.
    - Added support to read createtxg property directly from zfs handle
    for filesystem, volume and snapshot types instead of parsing nvlist.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Ameer Hamza <[email protected]>
    Closes #13577

commit 8792dd24cd9599cf506d45bcaed3af78c8cd888d
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jul 25 09:52:42 2022 -0700

    ZTS: Fix occasional inherit_001_pos.ksh failure

    The mountpoint may still be busy when the `zfs unmount -a` command
    is run causing an unexpected failure.  Retry the unmount a couple
    of times since it should not remain busy for long.

        19:10:50.29 NOTE: Reading state from .../inheritance/state021.cfg
        19:10:50.32 cannot unmount '/TESTPOOL': pool or dataset is busy
        19:10:50.32 ERROR: zfs unmount -a exited 1

    Reviewed-by: George Melikov <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13686

commit bf61a507a276866d691a2b56866302bc42145af3
Author: Christian Schwarz <[email protected]>
Date:   Thu Jul 21 02:16:29 2022 +0200

    zdb: dump spill block pointer if present

    Output will look like so:

      $ sudo zdb -dddd -vv testpool/fs 2
      Dataset testpool/fs [ZPL], ID 260, cr_txg 8, 25K, 7 objects, rootbp DVA[0]=<0:1800be00:200> DVA[1]=<0:1c00be00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=16L/16P fill=7 cksum=d03b396cd:489ca835517:d4b04a4d0a62:1b413aac454d53

          Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
               2    1   128K    512     1K     512    512    0.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                                     192   bonus  System attributes
          dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR
          dnode maxblkid: 0
          path    /testfile
          uid     0
          gid     0
          atime   Fri Jul 15 12:36:35 2022
          mtime   Fri Jul 15 12:36:35 2022
          ctime   Fri Jul 15 12:36:51 2022
          crtime  Fri Jul 15 12:36:35 2022
          gen 10
          mode    100600
          size    0
          parent  34
          links   1
          pflags  840800000004
          SA xattrs: 248 bytes, 2 entries

              security.selinux = nutanix_u:object_r:unlabeled_t:s0\000
              user.foo = xbLQJjyVvEVPGGuRHV/gjkFFO1MdehKnLjjd36ZaoMVaUqtqFoMMYT5Ya9yywHApJNoK/1hNJfO3\012XCJWv9/QUTKamoWW9xVDE7yi8zn166RNw5QUhf84cZ3JNLnw6oN

    Spill block: 0:10005c00:200 0:14005c00:200 200L/200P F=1 B=16/16 cksum=1cdfac47a4:910c5caa557:195d0493dfe5a:332b6fde6ad547
    Indirect blocks:

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Allan Jude <[email protected]>
    Signed-off-by: Christian Schwarz <[email protected]>
    Closes #13640

commit fb087146de0118108e3b44222d2052415dcb1f7f
Author: ixhamza <[email protected]>
Date:   Thu Jul 21 05:14:06 2022 +0500

    Add support for per dataset zil stats and use wmsum counters

    ZIL kstats are reported in an inclusive way, i.e., same counters are
    shared to capture all the activities happening in zil. Added support
    to report zil stats for every datset individually by combining them
    with already exposed dataset kstats.

    Wmsum uses per cpu counters and provide less overhead as compared
    to atomic operations. Updated zil kstats to replace wmsum counters
    to avoid atomic operations.

    Reviewed-by: Christian Schwarz <[email protected]>
    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Ameer Hamza <[email protected]>
    Closes #13636

commit 33dba8c79224ce33dc661d545ab1d17fc3d84a0c
Author: Alexander Motin <[email protected]>
Date:   Wed Jul 20 20:02:36 2022 -0400

    Fix scrub resume from newly created hole

    It may happen that scan bookmark points to a block that was turned
    into a part of a big hole.  In such case dsl_scan_visitbp() may skip
    it and dsl_scan_check_resume() will not be called for it.  As result
    new scan suspend won't be possible until the end of the object, that
    may take hours if the object is a multi-terabyte ZVOL on a slow HDD
    pool, stretching TXG to all that time, creating all sorts of problems.

    This patch changes the resume condition to any greater or equal block,
    so even if we miss the bookmarked block, the next one we find will
    delete the bookmark, allowing new suspend.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Ryan Moeller <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13643

commit 97fd1ea42a59b85ca29c91056ceef56c12cfae0b
Author: Tino Reichardt <[email protected]>
Date:   Thu Jul 21 02:01:32 2022 +0200

    Fix memory allocation for the checksum benchmark

    Allocation via kmem_cache_alloc() is limited to less then 4m for
    some architectures.

    This commit limits the benchmarks with the linear abd cache to 1m
    on all architectures and adds 4m + 16m benchmarks via non-linear
    abd_alloc().

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Co-authored-by: Sebastian Gottschall <[email protected]>
    Signed-off-by: Tino Reichardt <[email protected]>
    Closes #13669
    Closes #13670

commit f371cc18f81168c74314b77480862b6c516e15d5
Author: ixhamza <[email protected]>
Date:   Thu Jul 14 22:38:16 2022 +0500

    Expose ZFS dataset case sensitivity setting via sb_opts

    Makes the case sensitivity setting visible on Linux in /proc/mounts.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Ameer Hamza <[email protected]>
    Closes #13607

commit 9fe2f262aa24e3eda716787005cd127642aed22b
Author: Tony Hutter <[email protected]>
Date:   Thu Jul 14 10:19:37 2022 -0700

    zed: Look for NVMe DEVPATH if no ID_BUS

    We tried replacing an NVMe drive using autoreplace, only
    to see zed reject it with:

    zed[27955]: zed_udev_monitor: /dev/nvme5n1 no devid source

    This happened because ZED saw that ID_BUS was not set by udev
    for the NVMe drive, and thus didn't think it was "real drive".
    This commit allows NVMe drives to be autoreplaced even if
    ID_BUS is not set.

    Reviewed-by: Don Brady <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tony Hutter <[email protected]>
    Closes #13512
    Closes #13646

commit 1d3ba0bf01020f5459b1c28db3979129088924c0
Author: Tino Reichardt <[email protected]>
Date:   Mon Jul 11 23:16:13 2022 +0200

    Replace dead opensolaris.org license link

    The commit replaces all findings of the link:
    http://www.opensolaris.org/os/licensing with this one:
    https://opensource.org/licenses/CDDL-1.0

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tino Reichardt <[email protected]>
    Closes #13619

commit e4ab3f40df994178c5fc629c2ad07c505f5a76eb
Author: Tony Hutter <[email protected]>
Date:   Mon Jul 11 13:35:19 2022 -0700

    zed: Ignore false 'atari' partitions in autoreplace

    libudev will sometimes falsely identify an 'atari' partition on a
    blank disk, preventing it from being used in an autoreplace.  This
    seems to be a known issue.  The workaround is to just ignore the
    fake partition and continue with the autoreplace.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tony Hutter <[email protected]>
    Closes #13497
    Closes #13632

commit 677ca1e825af80f60569f84803304ccf0092728b
Author: Tony Hutter <[email protected]>
Date:   Mon Jul 11 11:35:01 2022 -0700

    rpm: Silence "unversioned Obsoletes" warnings on EL 9

    Get rid of RPM warnings on AlmaLinux 9:

    "It's not recommended to have unversioned Obsoletes"

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tony Hutter <[email protected]>
    Closes #13584
    Closes #13638

commit e6489be3470ecc1088b69a17b63a6d45d57f1c16
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jul 11 11:29:12 2022 -0700

    Linux: Align MODULE_LICENSE macro text

    Specify the lua and zstd license text in the manor in which the
    kernel MODULE_LICENSE macro requires it.  The now duplicate entries
    were merged and a comment added to make it clear what they apply to.

    Reviewed-by: Christian Schwarz <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13641

commit cb01da68057dcb9e612e8d2e97d058c46c3574af
Author: Finix1979 <[email protected]>
Date:   Fri Jul 8 02:43:58 2022 +0800

    Call nvlist_free before return

    Fixes a small kernel memory leak which would occur if a pool failed
    to import because the `DMU_POOL_VDEV_ZAP_MAP` key can't be read from
    a presumably damaged MOS config.  In the case of a missing key there
    was no leak.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Ryan Moeller <[email protected]>
    Signed-off-by: Finix1979 <[email protected]>
    Closes #13629

commit 74230a5bc1be6e5e84a5f41b26f6f65a155078f0
Author: Alexander Motin <[email protected]>
Date:   Tue Jul 5 19:27:29 2022 -0400

    Avoid memory copy when verifying raidz/draid parity

    Before this change for every valid parity column raidz_parity_verify()
    allocated new buffer and copied there existing data, then recalculated
    the parity and compared the result with the copy.  This patch removes
    the memory copy, simply swapping original buffer pointers with newly
    allocated empty ones for parity recalculation and comparison. Original
    buffers with potentially incorrect parity data are then just freed,
    while new recalculated ones are used for repair.

    On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this
    change reduces memory traffic during scrub by 17% and total unhalted
    CPU time by 25%.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13613

commit 1ac7d194e5ca31d5284e410a87ad9f9669a7f5b5
Author: Alexander Motin <[email protected]>
Date:   Tue Jul 5 19:26:20 2022 -0400

    Avoid memory copies during mirror scrub

    Issuing several scrub reads for a block we may use the parent ZIO
    buffer for one of child ZIOs.  If that read complete successfully,
    then we won't need to copy the data explicitly.  If block has only
    one copy (typical for root vdev, which is also a mirror inside),
    then we never need to copy -- succeed or fail as-is.  Previous
    code also copied data from buffer of every successfully completed
    child ZIO, but that just does not make any sense.

    On healthy N-wide mirror this saves all N+1 (or even more in case
    of ditto blocks) memory copies for each scrubbed block, allowing
    CPU to focus mostly on check-summing.  For other vdev types it
    should save one memory copy per block copy at root vdev.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Mark Maybee <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13606

commit 6fca6195cdf3079a6e11444e42d7ce4aca10c6a1
Author: наб <[email protected]>
Date:   Thu Jun 30 20:31:09 2022 +0200

    Re-fix -Wwrite-strings on FreeBSD

    Follow up fix for a926aab902ac5c680f4766568d19674b80fb58bb.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ahelenia Ziemiańska <[email protected]>
    Closes #13348
    Closes #13610

commit eefe83eaa68f7cb4a49c580dd940d3688e42c849
Author: Toyam Cox <[email protected]>
Date:   Thu Jun 30 13:47:58 2022 -0400

    dracut: fix boot on non-zfs-root systems

    Simply prevent overwriting root until it needs to be overwritten.

    Dracut could change this value before this module is called, but won't
    change the kernel command line.

    Reviewed-by: Andrew J. Hesford <[email protected]>
    Signed-off-by: Toyam Cox <[email protected]>
    Closes #13592

commit 5a4dd3a262106c1506bdc154d9474efa6510f33a
Author: gregory-lee-bartholomew <[email protected]>
Date:   Thu Jun 30 12:43:27 2022 -0500

    contrib: dracut: README.md

    Change zfs-snapshot-bootfs.service to zfs-rollback-bootfs.service in
    cmdline point 4 of README.md.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Gregory Bartholomew <[email protected]>
    Closes #13609

commit 34e5423f83202653bd6b153577187f6ed943c157
Author: George Amanakis <[email protected]>
Date:   Thu Jun 30 02:06:16 2022 +0200

    Fix dnode byteswapping

    If a dnode has a spill pointer, and we use DN_SLOTS_TO_BONUSLEN() then
    we will possibly include the spill pointer in the len calculation and it
    will be byteswapped. Then dnode_byteswap() will carry on and swap the
    spill pointer again. Fix this by using DN_MAX_BONUS_LEN() instead.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: George Amanakis <[email protected]>
    Closes #13002
    Closes #13015

commit 2d434e8ae4139ce14a1b058839a144bf952e79ea
Author: gregory-lee-bartholomew <[email protected]>
Date:   Wed Jun 29 18:56:04 2022 -0500

    contrib: dracut: zfs-{rollback,snapshot}-bootfs: explicit snapname fix

    Due to a missing semicolon on the ExecStart line, it wasn't possible
    to specify the snapshot name on the bootfs.{rollback,snapshot}
    kernel parameters if the boot dataset name was obtained from the
    root=zfs:... kernel parameter.

    Reviewed-by: Ahelenia Ziemiańska <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Gregory Bartholomew <[email protected]>
    Closes #13585

commit 07f2793e869196fcbcd5057d9ada377674262fe3
Author: Brian Behlendorf <[email protected]>
Date:   Wed Jun 29 15:33:38 2022 -0700

    dracut: fix typo in mount-zfs.sh.in

    Format the `zpool get` command correctly.  The -o option must
    be followed by "all" or the requested field name.

    Reviewed-by: Ahelenia Ziemiańska <[email protected]>
    Reviewed-by: George Melikov <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13602

commit 60e389ca10085acfa7cd35f79ab4465d968a942f
Author: наб <[email protected]>
Date:   Tue Jun 28 23:31:55 2022 +0200

    module: lua: ldo: fix pragma name

    /home/nabijaczleweli/store/code/zfs/module/lua/ldo.c:175:32: warning:
    unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas]
      175 | #pragma GCC diagnostic ignored "-Winfinite-recursion"a
          |                                ^~~~~~~~~~~~~~~~~~~~~~

    Fixes: a6e8113fed8a508ffda13cf1c4d8da99a4e8133a ("Silence
    -Winfinite-recursion warning in luaD_throw()")

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ahelenia Ziemiańska <[email protected]>
    Closes #13348

commit 404601aca07dc44273c1ab33e09067423f97c591
Author: наб <[email protected]>
Date:   Tue Jun 28 23:27:44 2022 +0200

    linux: libzfs: util: don't fallthrough to to end-of-switch

    lib/libzfs/os/linux/libzfs_util_os.c:262:3: error: fallthrough
    annotation does not directly precede switch label
                    zfs_fallthrough;
                    ^
    ./lib/libspl/include/sys/feature_tests.h:34:26: note: expanded from
    macro 'zfs_fallthrough'

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ahelenia Ziemiańska <[email protected]>
    Closes #13348

commit a2f6bff976e43c58d8460c288bac13868c614447
Author: наб <[email protected]>
Date:   Sat May 28 15:19:05 2022 +0200

    tests: modernise zdb_decompress

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ahelenia Ziemiańska <[email protected]>
    Closes #13348

commit dd66857d92d86643bda57b92fdd58f016bd1725e
Author: наб <[email protected]>
Date:   Tue Apr 19 20:49:30 2022 +0200

    Remaining {=> const} char|void *tag

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ahelenia Ziemiańska <[email protected]>
    Closes #13348

commit a926aab902ac5c680f4766568d19674b80fb58bb
Author: наб <[email protected]>
Date:   Tue Apr 19 20:38:30 2022 +0200

    Enable -Wwrite-strings

    Also, fix leak from ztest_global_vars_to_zdb_args()

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ahelenia Ziemiańska <[email protected]>
    Closes #13348

commit e7d90362e5d5f873e1272519da96780cf00a0e28
Author: gaoyanping <[email protected]>
Date:   Thu Jun 30 04:38:46 2022 +0800

    Fix znode group permission different from acl mask

    Zp->z_mode is set at the same time inode->i_mode
    is being changed. This has the effect of keeping both
    in sync without relying on zfs_znode_update_vfs.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: yanping.gao <[email protected]>
    Closes #13581

commit 325096545a47aa01cd5966301ba1f7a6e5eff349
Author: Kristof Provost <[email protected]>
Date:   Tue Jun 28 23:11:38 2022 +0200

    FreeBSD: only define B_FALSE/B_TRUE if NEED_SOLARIS_BOOLEAN is not set

    If NEED_SOLARIS_BOOLEAN is defined we define an enum boolean_t, which
    defines B_TRUE/B_FALSE as well. If we have both the define and the enum
    things don't build (because that translates to
    'enum { 0, 1 }     boolean_t').

    While here also remove an incorrect '#else'. With it in place we only
    parse a section if the include guard is triggered. So we'd only use that
    code if this file is included twice. This is clearly unintended, and
    also means we don't get the 'boolean_t' definition. Fix this.

    Reviewed-by: Warner Losh <[email protected]>
    Reviewed-by: Ryan Moeller <[email protected]>
    Signed-off-by: Kristof Provost <[email protected]>
    Sponsored-By: Rubicon Communications, LLC ("Netgate")
    Closes #13596

commit 827322991f25785ff032ad3cef84e12b1c609759
Author: Alexander Motin <[email protected]>
Date:   Tue Jun 28 14:23:31 2022 -0400

    Fix and disable blocks statistics during scrub

    Block statistics calculation during scrub I/O issue in case of sorted
    scrub accounted ditto blocks several times.  Embedded blocks on other
    side were not accounted at all.  This change moves the accounting from
    issue to scan stage, that fixes both problems and also allows to avoid
    pool-wide locking and the lock contention it created.

    Since this statistics is quite specific and is not even exposed now
    anywhere, disable its calculation by default to not waste CPU time.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13579

commit 43569ee374208e827409ec1ce4cf169d7a9a3095
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 23:36:21 2022 +0000

    Fix objtool: missing int3 after ret warning

    Resolve straight-line speculation warnings reported by objtool
    for x86_64 assembly on Linux when CONFIG_SLS is set.  See the
    following LWN article for the complete details.

    https://lwn.net/Articles/877845/

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit 8aceded193f58da9a3837fb9d828dab05ad9e82f
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 22:27:55 2022 +0000

    Fix -Wformat-overflow warning in zfs_project_handle_dir()

    Switch to using asprintf() to satisfy the compiler and resolve the
    potential format-overflow warning.  Not the conditional before the
    sprintf() would have prevented this regardless.

        cmd/zfs/zfs_project.c: In function ‘zfs_project_handle_dir’:
        cmd/zfs/zfs_project.c:241:38: error: ‘/’ directive writing
        1 byte into a region of size between 0 and 4352
        [-Werror=format-overflow=]
        cmd/zfs/zfs_project.c:241:17: note: ‘sprintf’ output between
        2 and 4609 bytes into a destination of size 4352

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit f11431a31776734722e14bd6186b93a25823a0ee
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 21:54:42 2022 +0000

    Fix -Wformat-truncation warning in upgrade_set_callback()

    Extend the buffer slightly resolve the warning.

        cmd/zfs/zfs_main.c: In function ‘upgrade_set_callback’:
        cmd/zfs/zfs_main.c:2446:22: error: ‘%llu’ directive output
        may be truncated writing between 1 and 20 bytes into a
        region of size 16 [-Werror=format-truncation=]
        cmd/zfs/zfs_main.c:2445:24: note: ‘snprintf’ output between
        2 and 21 bytes into a destination of size 16

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit c175f5ebb2f4910e2d4f38f794e3973e53baa94e
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 21:35:38 2022 +0000

    Fix -Wuse-after-free warning in dbuf_destroy()

    Move the use of the db pointer after it is freed.  It's only used as
    a tag so a dereference would never occur, but there's no reason we
    can't invert the order to resolve the warning.

        module/zfs/dbuf.c: In function 'dbuf_destroy':
        module/zfs/dbuf.c:2953:17: error:
        pointer 'db' may be used after 'free' [-Werror=use-after-free]

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit 9619bcdefb0dff38e36f8c89d9e2980112105cbb
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 21:32:03 2022 +0000

    Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done()

    Move the use of the private pointer after it is freed.  It's only
    used as a tag so a dereference would never occur, but there's no
    harm in inverting the order to resolve the warning.

        module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done':
        module/zfs/dbuf.c:3204:17: error:
        pointer 'private' may be used after 'free' [-Werror=use-after-free]

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit ff7e405f83fbfcd763c4b7ed8b68258227765731
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 21:13:26 2022 +0000

    Fix -Wattribute-warning in dsl layer

    The memcpy(), memmove(), and memset() functions have been annotated
    to perform bounds checking when using FORTIFY_SOURCE.  A warning is
    now generted when writing beyond the end of the specified field.

    Alternately, the new struct_group() macro could be used to create
    an anonymous union member for use by memcpy().  However, since this
    is the only place the macro would be helpful it's preferable to
    restructure the code slights to avoid the need for additional
    compatibility code when the macro does not exist.

    https://lore.kernel.org/lkml/[email protected]/T/

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit 18df6afdfc63b7c27cbb2b6152d76c40196e9dbb
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 19:37:38 2022 +0000

    Fix -Wattribute-warning in edonr

    The wrong union memory was being accessed in EdonRInit resulting in
    a write beyond size of field compiler warning.  Reference the correct
    member to resolve the warning.  The warning was correct and this in
    case the mistake was harmless.

        In function ‘fortify_memcpy_chk’,
        inlined from ‘EdonRInit’ at zfs/module/icp/algs/edonr/edonr.c:494:3:
        ./include/linux/fortify-string.h:344:25: error: call to
        ‘__write_overflow_field’ declared with attribute warning:
        detected write beyond size of field (1st parameter);
        maybe use struct_group()? [-Werror=attribute-warning]

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit b0f7dd276c930129fef8575e15a36ec659e31cd2
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 22:36:38 2022 +0000

    Fix -Wattribute-warning in zfs_log_xvattr()

    Restructure the code in zfs_log_xvattr() to use a lr_attr_end
    structure when accessing lr_attr_t elements located after the
    variable sized array.  This makes the code more understandable
    and resolves the accessing beyond the end of the field warnings.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit a6e8113fed8a508ffda13cf1c4d8da99a4e8133a
Author: Brian Behlendorf <[email protected]>
Date:   Mon Jun 20 19:53:58 2022 +0000

    Silence -Winfinite-recursion warning in luaD_throw()

    This code should be kept inline with the upstream lua version as much
    as possible.  Therefore, we simply want to silence the warning.  This
    check was enabled by default as part of -Wall in gcc 12.1.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Alexander Motin <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13528
    Closes #13575

commit 80a650b7bb04bce3aef5e4cfd1d966e3599dafd4
Author: George Amanakis <[email protected]>
Date:   Mon Jun 27 23:17:25 2022 +0200

    Avoid panic with recordsize > 128k, raw sending and no large_blocks

    The current codebase does not support raw sending buffers with block
    size > 128kB when large_blocks is not active. This can happen in the
    codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which
    calls back dmu_objset_write_done()->dsl_dataset_block_born(). If
    dsl_dataset_sync() completes its run before dsl_dataset_block_born() is
    called, we will end up not activating some of the necessary flags, while
    having blocks based on those flags written in the filesystem. A
    subsequent send will then panic.

    Fix this by directly deciding in dmu_objset_sync() whether these flags
    need to be activated later by dsl_dataset_sync(). Instead of panicking
    due to a NULL pointer dereference in dmu_dump_write() in case of a send,
    print out an error message. Also during scrub verify there are no
    contradicting filesystem flags.

    Reviewed-by: Paul Dagnelie <[email protected]>
    Signed-off-by: George Amanakis <[email protected]>
    Closes #12275
    Closes #12438

commit 1cd72b9c1348bbc71310fb6c3d49670362faf306
Author: Alexander Motin <[email protected]>
Date:   Mon Jun 27 14:08:21 2022 -0400

    Avoid two 64-bit divisions per scanned block

    Change math to make it like the ARC, using multiplications instead.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13591

commit c0bf952c846100750f526c2a32ebec17694a201b
Author: Alexander Motin <[email protected]>
Date:   Fri Jun 24 16:55:58 2022 -0400

    Several B-tree optimizations

    - Introduce first element offset within a leaf.  It allows to reduce
    by ~50% average memmove() size when adding/removing elements.  If the
    added/removed element is in the first half of the leaf, we may shift
    elements before it and adjust the bth_first instead of moving more
    elements after it.
     - Use memcpy() instead of memmove() when we know there is no overlap.
     - Switch from uint64_t to uint32_t.  It does not limit anything,
    but 32-bit arches should appreciate it greatly in hot paths.
     - Store leaf capacity in struct btree to avoid 64-bit divisions.
     - Adjust zfs_btree_insert_into_leaf() to always result in balanced
    leaves after splitting, no matter where the new element was inserted.
    Not that we care about it much, but it should also allow B-trees with
    as little as two elements per leaf instead of 4 previously.

    When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this
    reduces amount of time spent in memmove() inside the scan thread from
    13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes.
    It should also reduce spacemaps load time, but I haven't measured it.

    Reviewed-by: Paul Dagnelie <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13582

commit ccf89b39fe7f30dd53aec69e04de3f2728c7387c
Author: Alan Somers <[email protected]>
Date:   Fri Jun 24 14:28:42 2022 -0600

    Add a "zstream decompress" subcommand

    It can be used to repair a ZFS file system corrupted by ZFS bug #12762.
    Use it like this:

    zfs send -c <DS> | \
    zstream decompress <OBJECT>,<OFFSET>[,<COMPRESSION_ALGO>] ... | \
    zfs recv <DST_DS>

    Reviewed-by: Ahelenia Ziemiańska <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Allan Jude <[email protected]>
    Signed-off-by: Alan Somers <[email protected]>
    Sponsored-by:  Axcient
    Workaround for #12762
    Closes #13256

commit 1c0c729ab4165cd828fbeab404353b45b3836360
Author: Alexander Motin <[email protected]>
Date:   Fri Jun 24 12:50:37 2022 -0400

    Several sorted scrub optimizations

    - Reduce size and comparison complexity of q_exts_by_size B-tree.
    Previous code used two 64-bit divisions and many other operations to
    compare two B-tree elements.  It created enormous overhead.  This
    implementation moves the math to the upper level and stores the score
    in the B-tree elements themselves.  Since all that we need to store in
    that B-tree is the extent score and offset, those can fit into single
    8 byte value instead of 24 bytes of q_exts_by_addr element and can be
    compared with single operation.
     - Better decouple secondary tree logic from main range_tree by moving
    rt_btree_ops and related functions into dsl_scan.c as ext_size_ops.
    Those functions are very small to worry about the code duplication and
    range_tree does not need to know details such as rt_btree_compare.
     - Instead of accounting number of pending bytes per pool, that needs
    atomic on global variable per block, account the number of non-empty
    per-vdev queues, that change much more rarely.
     - When extent scan is interrupted by TXG end, continue it in the next
    TXG instead of selecting next best extent.  It allows to avoid leaving
    one truncated (and so likely not the best any more) extent each TXG.

    On top of some other optimizations this saves about 1.5 minutes out of
    10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks.

    Reviewed-by: Paul Dagnelie <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Tom Caputi <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13576

commit 83691bebf062d13e64a93ff80bf727cd1c162bb2
Author: Toomas Soome <[email protected]>
Date:   Fri Jun 24 19:48:10 2022 +0300

    Use macros for quotes and such

    Use Dq,Pq/Po/Pc macros. illumos dumpadm is now in section 8.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Toomas Soome <[email protected]>
    Closes #13586

commit ad8b9f940c1e39a38af61934737c1e4cf8ab5c08
Author: Brian Behlendorf <[email protected]>
Date:   Thu Jun 23 10:36:28 2022 -0700

    Scrub mirror children without BPs

    When scrubbing a raidz/draid pool, which contains a replacing or
    sparing mirror with multiple online children, only one child will
    be read.  This is not normally a serious concern because the DTL
    records are used to determine where a good copy of the data is.
    As long as the data can be read from one child the mirror vdev
    will use it to repair gaps in any of its children.  Furthermore,
    even if the data which was read is corrupt the raidz code will
    detect this and issue its own repair I/O to correct the damage
    in the mirror vdev.

    However, in the scenario where the DTL is wrong due to silent
    data corruption (say due to overwriting one child) and the scrub
    happens to read from a child with good data, then the other damaged
    mirror child will not be detected nor repaired.

    While this is possible for both raidz and draid vdevs, it's most
    pronounced when using draid.  This is because by default the zed
    will sequentially rebuild a draid pool to a distributed spare,
    and the distributed spare half of the mirror is always preferred
    since it delivers better performance.  This means the damaged
    half of the mirror will go undetected even after scrubbing.

    For system administrations this behavior is non-intuitive and in
    a worst case scenario could result in the only good copy of the
    data being unknowingly detached from the mirror.

    This change resolves the issue by reading all replacing/sparing
    mirror children when scrubbing.  When the BP isn't available for
    verification, then compare the data buffers from each child.  They
    must all be identical, if not there's silent damage and an error
    is returned to prompt the top-level vdev to issue a repair I/O to
    rewrite the data on all of the mirror children.  Since we can't
    tell which child was wrong a checksum error is logged against the
    replacing or sparing mirror vdev.

    Reviewed-by: Mark Maybee <[email protected]>
    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13555

commit deb1213098e2dc10e6eee5e5c57bb40584e096a6
Author: Tino Reichardt <[email protected]>
Date:   Tue Jun 21 23:32:09 2022 +0200

    Fix memory allocation issue for BLAKE3 context

    The kmem_alloc(sizeof (*ctx), KM_NOSLEEP) call on FreeBSD can't be
    used in this code segment. Work around this by pre-allocating a percpu
    context array for later use.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tino Reichardt <[email protected]>
    Closes #13568

commit b17663f571bfa1ef5e77d3c72f1610bacfc0c6ad
Author: Matthew Thode <[email protected]>
Date:   Tue Jun 21 12:37:20 2022 -0500

    Remove install of zfs-load-module.service for dracut

    The zfs-load-module.service service is not currently provided by
    the OpenZFS repository so we cannot safely assume it exists.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Matthew Thode <[email protected]>
    Closes #13574

commit d51f4ea5f9575847a271f2c99253b072a6ede07e
Author: Alexander Motin <[email protected]>
Date:   Fri Jun 17 18:38:51 2022 -0400

    FreeBSD: Improve crypto_dispatch() handling

    Handle crypto_dispatch() return values same as crp->crp_etype errors.
    On FreeBSD 12 many drivers returned same errors both ways, and lack
    of proper handling for the first ended up in assertion panic later.
    It was changed in FreeBSD 13, but there is no reason to not be safe.

    While there, skip waiting for completion, including locking and
    wakeup() call, for sessions on synchronous crypto drivers, such as
    typical aesni and software.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13563

commit f60973998593c37f55beea05e74528c1992b7849
Author: Andrew <[email protected]>
Date:   Fri Jun 17 13:44:49 2022 -0500

    expose snapshot count via stat(2) of .zfs/snapshot (#13559)

    Increase nlinks in stat results of ./zfs/snapshot based on snapshot
    count. This provides quick and efficient method for administrators to
    get snapshot counts without having to use libzfs or list the snapdir
    contents.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Andrew Walker <[email protected]>
    Closes #13559

commit 10891b37fa8f2cbf71ec529fc3808113d94d52ef
Author: ixhamza <[email protected]>
Date:   Thu Jun 16 02:26:12 2022 +0500

    libzfs: Prevent overridding of error code

    zfs_send_cb_impl fails to report error for some flags.

    Use second error variable for send_conclusion_record.

    Reviewed-by: Ryan Moeller <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ameer Hamza <[email protected]>
    Closes #13558

commit dd8671459f59484b1c20fc1b5e1c3acaa2a290c1
Author: Alexander Motin <[email protected]>
Date:   Wed Jun 15 17:25:08 2022 -0400

    Reduce ZIO io_lock contention on sorted scrub

    During sorted scrub multiple threads (one per vdev) are issuing many
    ZIOs same time, all using the same scn->scn_zio_root ZIO as parent.
    It causes huge lock contention on the single global lock on that ZIO.
    Improve it by introducing per-queue null ZIOs, children to that one,
    and using them instead as proxy.

    For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this
    dramatically reduces lock contention and reduces scrub time from 21
    minutes down to 12.5, while actual read stages (not scan) are about
    3x faster, reaching 100K blocks per second per vdev.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13553

commit bc00d2c711c5bd930fe63deae3946ed3fb2463b4
Author: crass <[email protected]>
Date:   Wed Jun 15 16:22:52 2022 -0500

    Add support for ARCH=um for x86 sub-architectures

    When building modules (as well as the kernel) with ARCH=um, the options
    -Dsetjmp=kernel_setjmp and -Dlongjmp=kernel_longjmp are passed to the C
    preprocessor for C files. This causes the setjmp and longjmp used in
    module/lua/ldo.c to be kernel_setjmp and kernel_longjmp respectively in
    the object file. However, the setjmp and longjmp that is intended to be
    called is defined in an architecture dependent assembly file under the
    directory module/lua/setjmp. Since it is an assembly and not a C file,
    the preprocessor define is not given and the names do not change. This
    becomes an issue when modpost is trying to create the Module.symvers
    and sees no defined symbol for kernel_setjmp and kernel_longjmp. To fix
    this, if the macro CONFIG_UML is defined, then setjmp and longjmp
    macros are undefined.

    When building with ARCH=um for x86 sub-architectures, CONFIG_X86 is not
    defined. Instead, CONFIG_UML_X86 is defined. Despite this, the UML x86
    sub-architecture can use the same object files as the x86 architectures
    because the x86 sub-architecture UML kernel is running with the same
    instruction set as CONFIG_X86. So the modules/Kbuild build file is
    updated to add the same object files that CONFIG_X86 would add when
    CONFIG_UML_X86 is defined.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Glenn Washburn <[email protected]>
    Closes #13547

commit 988431966639d791ac269011d136e85f3602df75
Author: Damian Szuberski <[email protected]>
Date:   Thu Jun 16 07:20:28 2022 +1000

    Fix clang 13 compilation errors

    ```
    os/linux/zfs/zvol_os.c:1111:3: error: ignoring return value of function
      declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
                    add_disk(zv->zv_zso->zvo_disk);
                    ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~

    zpl_xattr.c:1579:1: warning: no previous prototype for function
      'zpl_posix_acl_release_impl' [-Wmissing-prototypes]
    ```

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: szubersk <[email protected]>
    Closes #13551

commit 4ff7a8fa2f1e66c4a10ecd4fd74c9381db24fb02
Author: Allan Jude <[email protected]>
Date:   Tue Jun 14 14:27:53 2022 -0400

    Replace ZPROP_INVAL with ZPROP_USERPROP where it means a user property

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Allan Jude <[email protected]>
    Sponsored-by: Klara Inc.
    Closes #12676

commit 9e605cf155040f76f3971f056a4410ac8f500fe3
Author: Ryan Moeller <[email protected]>
Date:   Mon Jun 13 20:30:34 2022 +0000

    spl: Use a clearer name for the user namespace fd

    This fd has nothing to do with cleanup, that's just the name of the
    field in zfs_cmd_t that was used to pass it to the kernel.

    Call it what it is, an fd for a user namespace.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Allan Jude <[email protected]>
    Signed-off-by: Ryan Moeller <[email protected]>
    Closes #13554

commit def1a401f44abce71196949feaecbd66ebcddf0b
Author: Ryan Moeller <[email protected]>
Date:   Mon Jun 13 20:24:23 2022 +0000

    libzfs: zfs_userns: Don't leak the namespace fd

    zfs_userns opens a file descriptor for the kernel to look up a
    namespace, but does not close it.

    Close the fd when we're done with it.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Allan Jude <[email protected]>
    Signed-off-by: Ryan Moeller <[email protected]>
    Closes #13554

commit 482505fd4219a4c3a91f2857199436df27ad687e
Author: Julian Brunner <[email protected]>
Date:   Sat Jun 11 03:22:14 2022 +0200

    Add weekly and monthly systemd timers for trimming

    On machines using systemd, trim timers can be enabled on a per-pool
    basis. Weekly and monthly timer units are provided. Timers can be
    enabled as follows:

    systemctl enable [email protected] --now
    systemctl enable [email protected] --now

    Each timer will pull in zfs-trim@${poolname}.service, which is not
    schedule-specific.

    The manpage zpool-trim has been updated accordingly.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Julian Brunner <[email protected]>
    Closes #13544

commit 87b46d63b283d9e2c3b10945f37233d1dedef02a
Author: Alexander Motin <[email protected]>
Date:   Fri Jun 10 13:01:46 2022 -0400

    Improve sorted scan memory accounting

    Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should
    count 2x sizeof (range_seg_gap_t) per node.  And since average B-tree
    memory efficiency is about 75%, we should increase it to 3x.

    Previous code under-counted up to 30% of the memory usage.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13537

commit 4ed5e25074ffec266df38556d9b3a928c5e0dee9
Author: Will Andrews <[email protected]>
Date:   Sun Feb 21 10:19:43 2021 -0600

    Add Linux namespace delegation support

    This allows ZFS datasets to be delegated to a user/mount namespace
    Within that namespace, only the delegated datasets are visible
    Works very similarly to Zones/Jailes on other ZFS OSes

    As a user:
    ```
     $ unshare -Um
     $ zfs list
    no datasets available
     $ echo $$
    1234
    ```

    As root:
    ```
     # zfs list
    NAME                            ZONED  MOUNTPOINT
    containers                      off    /containers
    containers/host                 off    /containers/host
    containers/host/child           off    /containers/host/child
    containers/host/child/gchild    off    /containers/host/child/gchild
    containers/unpriv               on     /unpriv
    containers/unpriv/child         on     /unpriv/child
    containers/unpriv/child/gchild  on     /unpriv/child/gchild

     # zfs zone /proc/1234/ns/user containers/unpriv
    ```

    Back to the user namespace:
    ```
     $ zfs list
    NAME                             USED  AVAIL     REFER  MOUNTPOINT
    containers                       129M  47.8G       24K  /containers
    containers/unpriv                128M  47.8G       24K  /unpriv
    containers/unpriv/child          128M  47.8G      128M  /unpriv/child
    ```

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Will Andrews <[email protected]>
    Signed-off-by: Allan Jude <[email protected]>
    Signed-off-by: Mateusz Piotrowski <[email protected]>
    Co-authored-by: Allan Jude <[email protected]>
    Co-authored-by: Mateusz Piotrowski <[email protected]>
    Sponsored-by: Buddy <https://buddy.works>
    Closes #12263

commit a1aa8f14c864b6851649f9c3e74e9f12e6518edd
Author: Allan Jude <[email protected]>
Date:   Fri Jul 2 19:16:58 2021 +0000

    Revert parts of 938cfeb0f27303721081223816d4f251ffeb1767

    When read and writing the UID/GID, we always want the value
    relative to the root user namespace, the kernel will take care
    of remapping this to the user namespace for us.

    Calling from_kuid(user_ns, uid) with a unmapped uid will return -1
    as that uid is outside of the scope of that namespace, and will result
    in the files inside the namespace all being owned by 'nobody' and not
    being allowed to call chmod or chown on them.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Allan Jude <[email protected]>
    Closes #12263

commit fc5200aa9b345972fc1b99869b03a373090b84c7
Author: Alexander Motin <[email protected]>
Date:   Thu Jun 9 18:27:36 2022 -0400

    AVL: Remove obsolete branching optimizations

    Modern Clang and GCC can successfully implement simple conditions
    without branching with math and flag operations.  Use of arrays for
    translation no longer helps as much as it was 14+ years ago.

    Disassemble of the code generated by Clang 13.0.0 on FreeBSD 13.1,
    Clang 14.0.4 on FreeBSD 14 and GCC 10.2.1 on Debian 11 with this
    change still shows no branching instructions.

    Profiling of CPU-bound scan stage of sorted scrub shows reproducible
    reduction of time spent inside avl_find() from 6.52% to 4.58%.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13540

commit cdeb98a116a3b31558d6d7cb861b0cbd992a27cc
Author: Ryan Moeller <[email protected]>
Date:   Wed Jun 8 16:32:38 2022 +0000

    libzfs: Rename msg bufs to errbuf for consistency

    `libzfs_pool.c` uses the name `msg` where everywhere else in libzfs uses
    `errbuf` for the error message buffer.

    Use the name consistent with the rest of libzfs and use ERRBUFLEN
    instead of 1024.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ryan Moeller <[email protected]>
    Closes #13539

commit d68a2bfa994d3d064924e8c303a9a7e8e6e02682
Author: Ryan Moeller <[email protected]>
Date:   Wed Jun 8 13:08:10 2022 +0000

    libzfs: Define the defecto standard errbuf size

    Every errbuf array in libzfs is 1024 chars.

    Define ERRBUFLEN in a shared header, and use it.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Ryan Moeller <[email protected]>
    Closes #13539

commit 6f73d02168ea4d4c27e95d3f23df7221c7321e07
Author: Tony Hutter <[email protected]>
Date:   Thu Jun 9 07:10:38 2022 -0700

    zvol: Support blk-mq for better performance

    Add support for the kernel's block multiqueue (blk-mq) interface in
    the zvol block driver.  blk-mq creates multiple request queues on
    different CPUs rather than having a single request queue.  This can
    improve zvol performance with multithreaded reads/writes.

    This implementation uses the blk-mq interfaces on 4.13 or newer
    kernels.  Building against older kernels will fall back to the
    older BIO interfaces.

    Note that you must set the `zvol_use_blk_mq` module param to
    enable the blk-mq API.  It is disabled by default.

    In addition, this commit lets the zvol blk-mq layer process whole
    `struct request` IOs at a time, rather than breaking them down
    into their individual BIOs.  This reduces dbuf lock contention
    and overhead versus the legacy zvol submit_bio() codepath.

    	sequential dd to one zvol, 8k volblocksize, no O_DIRECT:

    	legacy submit_bio()     292MB/s write  453MB/s read
    	this commit             453MB/s write  885MB/s read

    It also introduces a new `zvol_blk_mq_chunks_per_thread` module
    parameter. This parameter represents how many volblocksize'd chunks
    to process per each zvol thread.  It can be used to tune your zvols
    for better read vs write performance (higher values favor write,
    lower favor read).

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Ahelenia Ziemiańska <[email protected]>
    Reviewed-by: Tony Nguyen <[email protected]>
    Signed-off-by: Tony Hutter <[email protected]>
    Closes #13148
    Issue #12483

commit 985c33b132f6c23a69bd808e008ae0f46131a31e
Author: Tino Reichardt <[email protected]>
Date:   Thu Jun 9 00:55:57 2022 +0200

    Introduce BLAKE3 checksums as an OpenZFS feature

    This commit adds BLAKE3 checksums to OpenZFS, it has similar
    performance to Edon-R, but without the caveats around the latter.

    Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
    Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3

    Short description of Wikipedia:

      BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
      created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
      Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
      World Crypto. BLAKE3 is a single algorithm with many desirable
      features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
      and BLAKE2, which are algorithm families with multiple variants.
      BLAKE3 has a binary tree structure, so it supports a practically
      unlimited degree of parallelism (both SIMD and multithreading) given
      enough input. The official Rust and C implementations are
      dual-licensed as public domain (CC0) and the Apache License.

    Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
    new benchmarking file called chksum_bench was introduced.  When read
    it reports the speed of the available checksum functions.

    On Linux: cat /proc/spl/kstat/zfs/chksum_bench
    On FreeBSD: sysctl kstat.zfs.misc.chksum_bench

    This is an example output of an i3-1005G1 test system with Debian 11:

    implementation      1k      4k     16k     64k    256k      1m      4m
    edonr-generic     1196    1602    1761    1749    1762    1759    1751
    skein-generic      546     591     608     615     619     612     616
    sha256-generic     240     300     316     314     304     285     276
    sha512-generic     353     441     467     476     472     467     426
    blake3-generic     308     313     313     313     312     313     312
    blake3-sse2        402    1289    1423    1446    1432    1458    1413
    blake3-sse41       427    1470    1625    1704    1679    1607    1629
    blake3-avx2        428    1920    3095    3343    3356    3318    3204
    blake3-avx512      473    2687    4905    5836    5844    5643    5374

    Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)

    implementation      1k      4k     16k     64k    256k      1m      4m
    edonr-generic     1840    2458    2665    2719    2711    2723    2693
    skein-generic      870     966     996     992    1003    1005    1009
    sha256-generic     415     442     453     455     457     457     457
    sha512-generic     608     690     711     718     719     720     721
    blake3-generic     301     313     311     309     309     310     310
    blake3-sse2        343    1865    2124    2188    2180    2181    2186
    blake3-sse41       364    2091    2396    2509    2463    2482    2488
    blake3-avx2        365    2590    4399    4971    4915    4802    4764

    Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)

    implementation      1k      4k     16k     64k    256k      1m      4m
    edonr-generic     1213    1703    1889    1918    1957    1902    1907
    skein-generic      434     492     520     522     511     525     525
    sha256-generic     167     183     187     188     188     187     188
    sha512-generic     186     216     222     221     225     224     224
    blake3-generic     153     152     154     153     151     153     153
    blake3-sse2        391    1170    1366    1406    1428    1426    1414
    blake3-sse41       352    1049    1212    1174    1262    1258    1259

    Output on Debian 5.10.0-11-arm64 system: (Pi400)

    implementation      1k      4k     16k     64k    256k      1m      4m
    edonr-generic      487     603     629     639     643     641     641
    skein-generic      271     299     303     308     309     309     307
    sha256-generic     117     127     128     130     130     129     130
    sha512-generic     145     165     170     172     173     174     175
    blake3-generic      81      29      71      89      89      89      89
    blake3-sse2        112     323     368     379     380     371     374
    blake3-sse41       101     315     357     368     369     364     360

    Structurally, the new code is mainly split into these parts:
    - 1x cross platform generic c variant: blake3_generic.c
    - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
    - 2x assembly for ARMv8 (NEON converted from SSE2)
    - 2x assembly for PPC64-LE (POWER8 converted from SSE2)
    - one file for switching between the implementations

    Note the PPC64 assembly requires the VSX instruction set and the
    kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.

    Reviewed-by: Felix Dörre <[email protected]>
    Reviewed-by: Ahelenia Ziemiańska <[email protected]>
    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Tino Reichardt <[email protected]>
    Co-authored-by: Rich Ercolani <[email protected]>
    Closes #10058
    Closes #12918

commit b9d98453f9387c413f91d1d9cdb0cba8e04dbd95
Author: Brian Behlendorf <[email protected]>
Date:   Tue May 31 16:42:49 2022 -0700

    autoconf: AC_MSG_CHECKING consistency

    Make the wording more consistent for the kernel AC_MSG_CHECKING
    output (e.g. "checking whether ...".).  Additionally, group some
    of the VFS interface checks with the others.  No functional change.

    Reviewed-by: Tony Hutter <[email protected]>
    Reviewed-by: Attila Fülöp <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13529

commit 4c6526208db0d3d5abf44664e74d1e28156a3db7
Author: Brian Behlendorf <[email protected]>
Date:   Tue May 31 16:30:59 2022 -0700

    Linux 5.19 compat: asm/fpu/internal.h

    As of the Linux 5.19 kernel the asm/fpu/internal.h header was
    entirely removed.  It has been effectively empty since the 5.16
    kernel and provides no required functionality.

    Reviewed-by: Tony Hutter <[email protected]>
    Reviewed-by: Attila Fülöp <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13529

commit 42cf2ad0e4e2adfa232f42e4254693467a4cc08c
Author: Alexander Motin <[email protected]>
Date:   Wed Jun 1 12:54:35 2022 -0400

    Remove wrong assertion in log spacemap

    It is typical, but not generally true that if log summary has more
    blocks it must also have unflushed metaslabs.  Normally with metaslabs
    flushed in order it works, but there are known exceptions, such as
    device removal or metaslab being loaded during its flush attempt.

    Before 600a02b8844 if spa_flush_metaslabs() hit loading metaslab it
    usually stopped (unless memlimit is also exceeded), but now it may
    flush more metaslabs, just skipping that particular one.  This
    increased chances of assertion to fire when the skipped metaslab is
    flushed on next iteration if all other metaslabs in that summary
    entry are already flushed out of order.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Alexander Motin <[email protected]>
    Sponsored-By: iXsystems, Inc.
    Closes #13486
    Closes #13513

commit bc8192cd5b14cf4182bd2b19a6a8ed4f0bbed12b
Author: Rich Ercolani <[email protected]>
Date:   Tue May 31 18:41:33 2022 -0400

    Corrected parameters for zstd early abort

    That'll teach me to try and recall them from the definition.

    Reviewed-by: Brian Behlendorf <[email protected]>
    Signed-off-by: Rich Ercolani <[email protected]>
    Closes #13519

commit 2310dba9ebf6259515b63fda3202199831669271
Author: Allan Jude <[email protected]>
Date:   Tue May 31 18:37:46 2022 -0400

    Fix typo in zil_commit() comment block

    Reviewed-by: Brian Behlendorf <[email protected]>
    Reviewed-by: Ryan Moeller <[email protected]>
    Signed-off-by: Allan Jude <[email protected]>
    Closes #13518

commit a70e613070a8ca96f8214ba1ff61549cbbad0a2f
Author: Brian Behlendorf <[email protected]>
Date:   Tue May 31 14:38:00 2022 -0700

    Linux 5.18 compat: META

    Update the META file to reflect compatibility with the 5.18 kernel.

    Reviewed-by: George Melikov <[email protected]>
    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13527

commit 91350681b8c8b3f0a9b04e6ab3b8931406e87355
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 15:56:05 2022 -0700

    Linux 5.19 compat: zap_flags_t conflict

    As of the Linux 5.19 kernel an identically named zap_flags_t typedef
    is declared in the include/linux/mm_types.h linux header.  Sadly,
    the inclusion of this header cannot be easily avoided.  To resolve
    the conflict a #define is used to remap the name in the OpenZFS
    sources when building against the Linux kernel.

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit d41e864181e4544eca08332b31f85318a3b0e3b3
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 21:31:03 2022 +0000

    Linux 5.19 compat: bdev_start_io_acct() / bdev_end_io_acct()

    As of the Linux 5.19 kernel the disk_*_io_acct() helper functions
    have been replaced by the bdev_*_io_acct() functions.

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit c2c2e7bb8b7c269904777b61f4b0a678f1ffb9a3
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 20:44:43 2022 +0000

    Linux 5.19 compat: aops->read_folio()

    As of the Linux 5.19 kernel the readpage() address space operation
    has been replaced by read_folio().

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit a12a5cb5b821f24f26d388094cdac79deb0e879f
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 19:40:22 2022 +0000

    Linux 5.19 compat: blkdev_issue_secure_erase()

    Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure
    erase functionality from the blkdev_issue_discard() function.
    The blkdev_issue_secure_erase() must now be issued to issue
    a secure erase.

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit e2c31f2bc7d190fbd8fc5c13bac23daffc5d7b56
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 18:20:04 2022 +0000

    Linux 5.19 compat: bdev_max_secure_erase_sectors()

    Linux 5.19 commit torvalds/linux@44abff2c0 removed the
    blk_queue_secure_erase() helper function.  The preferred
    interface is to now use the bdev_max_secure_erase_sectors()
    function to check for discard support.

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit 5e4aedaca7cee981ed21ac856fd27b4682bb7888
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 17:51:55 2022 +0000

    Linux 5.19 compat: bdev_max_discard_sectors()

    Linux 5.19 commit torvalds/linux@70200574cc removed the
    blk_queue_discard() helper function.  The preferred interface
    is to now use the bdev_max_discard_sectors() function to check
    for discard support.

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit 5f264996f4dc2d5279afe96698688a20c281c473
Author: Brian Behlendorf <[email protected]>
Date:   Fri May 27 20:28:51 2022 +0000

    Linux 5.18 compat: bio_alloc()

    As for the Linux 5.18 kernel bio_alloc() expects a block_device struct
    as an argument.  This removes the need for the bio_set_dev() compatibility
    code for 5.18 and newer kernels.

    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Closes #13515

commit 152d6fda54e61042a70059c95c44b364aea0bbd8
Author: Kevin Jin <[email protected]>
Date:   Thu May 26 12:36:14 2022 -0400

    Fix inflated quiesce time caused by lwb_tx during zil_commit()

    In current zil_commit() process, transaction lwb_tx is assigned in
    zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done().
    Thus, during lwb write out process, the txg is held in open or quiesing
    state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio
    latency is high, it will cause txg_sync_thread() …
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 12, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
beren12 pushed a commit to beren12/zfs that referenced this pull request Sep 19, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read.  This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children.  Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.

However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.

While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid.  This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance.  This means the damaged
half of the mirror will go undetected even after scrubbing.

For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.

This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing.  When the BP isn't available for
verification, then compare the data buffers from each child.  They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children.  Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.

Reviewed-by: Mark Maybee <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#13555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants