Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect ashift value may cause faulted pools #425

Closed
gunnarbeutner opened this issue Oct 20, 2011 · 2 comments
Closed

Incorrect ashift value may cause faulted pools #425

gunnarbeutner opened this issue Oct 20, 2011 · 2 comments
Milestone

Comments

@gunnarbeutner
Copy link
Contributor

While investigating another issue I've come across this:

root@zt:~# zpool create -f -o ashift=16 tank sdb
root@zt:~# dd if=/dev/zero of=/tank/test bs=1M &
[1] 3405
root@zt:~# sleep 5; echo b > /proc/sysrq-trigger

after the reboot:

root@zt:~# zpool status
  pool: tank
 state: FAULTED
status: One or more devices could not be used because the label is missing
        or invalid.  There are insufficient replicas for the pool to continue
        functioning.
action: Destroy and re-create the pool from
        a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        FAULTED      0     0     0  corrupted data
          sdb       UNAVAIL      0     0     0  corrupted data
root@zt:~#

I'm able to reproduce this same behavior with ashift values as low as 14. The disk in this test is an iSCSI LUN (with 64kB sectors) that's passed-through to the VM by VMware as a disk with 512-byte sectors.

There are two things I'm currently wondering about:

a) Can this possibly happen with lower ashift values (e.g. ashift=12 with 512-bytes-per-sector disks) and is this because ZFS assumes that single-sector-writes are atomic (I'm really just guessing here).

b) Depending on how (a) turns out, is this suggestion (from zpool(8)) such a great idea:

       For optimal performance, the pool sector size should be greater than or equal to the
       sector size of the underlying disks. Since the property cannot be changed after pool
       creation,  if in a given pool, you ever want to use drives that report 4KiB sectors,
       you must set ashift=12 at pool creation time.
@gunnarbeutner
Copy link
Contributor Author

Actually, even without dd/reboot the pool faults when it's exported/re-imported or when I try to scrub the pool. Hmm.

@behlendorf
Copy link
Contributor

That's not good. Why larger ashift values are damaging isn't immediately clear to me but it's clearly causing damage to the label. The larger ashift will impact the total number of uberblocks which can be stored in the fixed size labels and perhaps we have an overrun. It would probably be wise to restrict the maximum ashift size to 12 which has been well tested.

Rudd-O pushed a commit to Rudd-O/zfs that referenced this issue Feb 1, 2012
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe.  What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size.  Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.

Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#425
ahrens added a commit to ahrens/zfs that referenced this issue Sep 9, 2021
…penzfs#425)

This module provides a mechanism to have the Agent process randmly exit at
certain "interesting" points, to test recovery on restart.  To use it, set
the "die_mtbf_secs" tunable to the desired mean time between failures, in
seconds.  A time point between 0 and 2x the configured time will be selected
as the amount of time to run before dying.  At that point, a random call
site of `maybe_die_with()` will be selected to exit the process.

Note that each *call site* (source file, line, column) is equally likely to
die, not each *call* (invocation of maybe_die_with()).  For example,
maybe_die_with() is called 1000x/sec from one call site and 1x/sec from
another call site, we will be equally likely to terminate via each of the 2
call sites.  Therefore you don't need to worry about adding a high-frequency
caller and having it "always" die on that caller.
sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants