Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBUSY when using hot spare #250

Closed
behlendorf opened this issue May 20, 2011 · 15 comments
Closed

EBUSY when using hot spare #250

behlendorf opened this issue May 20, 2011 · 15 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

Hot spares appear to be broken in the latest ZFS source. When attempting to kick in a hot spare EBUSY is returned when opening the hot spare device. This is likely because O_EXCL is passed as an option. Richard Laager first noticed this issue and reported it on the zfs-discuss mailing list.

...
open("/dev/zfs", O_RDWR)                = 3
open("/etc/mtab", O_RDONLY)             = 4
open("/etc/dfs/sharetab", O_RDONLY)     = -1 ENOENT (No such file or directory)
ioctl(3, ITE_GPIO_GEN_CTRL, 0x7fff0653b950) = 0
access("/dev/sdo", F_OK)                = 0
open("/dev/sdo", O_RDWR|O_EXCL|O_DIRECT) = -1 EBUSY (Device or resource busy)
stat("/dev/sdo", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 224), ...}) = 0
open("/dev/sdo", O_RDONLY|O_EXCL)       = -1 EBUSY (Device or resource busy)
stat("/dev/sdo", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 224), ...}) = 0
open("/dev/sdo", O_RDONLY)              = 5
fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 224), ...}) = 0
@behlendorf
Copy link
Contributor Author

@pyavdr
Copy link
Contributor

pyavdr commented Apr 18, 2012

While digging into that case, i observed that it gives the same error "device busy", when trying to replace a cache device, which ist not really a new case.

I tried to strip alle the O_EXCL open modes in the code tree, but failing lastly on an ioctl which tries to update the partition table, which is giving up on "device busy". Somewhere in the kernel, there is the needed lock for the hostspares and there is maybe only one hot spot to resolve that issue. I cant find the spot to release the lock before attempting to exchange the vdisks. maybe just a call to vdev remove is needed to resolve this issue. Can´t find the right spot in the code for it.

@dechamps
Copy link
Contributor

Here's the actual stack traces of open(O_EXCL) failures from userspace. They all have in common zpool_do_replace > zpool_do_attach_or_replace > make_root_vdev:

  • construct_spec > make_leaf_vdev > is_shorthand_path > is_whole_disk > open
  • check_in_use > is_spare > open
  • check_in_use > check_device > check_disk > check_slice > check_file > open
  • make_disks > zero_label > open

These calls fail because the module already opened the device in exclusive mode from kernel space via spa_import > spa_load_spares > vdev_open > vdev_disk_open.

maybe just a call to vdev remove is needed to resolve this issue.

That's probably the simplest way. That would make replace A B equivalent to remove B && replace A B for spare devices. I'm not sure of the potential side effects of such a decision. An alternative would be to make special cases for spare devices so they aren't open in exclusive mode but 1) that would get messy fast and 2) that reduces protection from operator error.

@dechamps
Copy link
Contributor

That would make replace A B equivalent to remove B && replace A B for spare devices. I'm not sure of the potential side effects of such a decision.

https://blogs.oracle.com/eschrock/entry/zfs_hot_spares

Note that even though the resilver is completed, the 'spare' vdev stays in-place (unlike a 'replacing' vdev). This is because the replacement is only temporary. Once the original device is replaced, then the spare will be returned to the pool.

If we remove the vdev before replacing it, then it won't be regarded as a spare vdev anymore and we lose this behavior. Maybe we can find a way to make the module close the vdev without making it forget about it entirely.

@pyavdr
Copy link
Contributor

pyavdr commented Jul 12, 2012

You are absolutely right. vdev remove & vdev replace is only a bad workaround. Pls read my discussion with Brian at issue #2. I have a zfsd, which checks the zpool events. First this zfsd should replace an unavailable vdev with a hotspare in case of a vdev is not available ( there should be some checks like smart and i/o errors lateron). Next it may be used to do an zpool autoexpand automatically.

I saw this stack traces too, but don´t know the spa_import code path in kernel space. This lock situation is only needed to prevent others to use the spare vdev while it is associated to the zpool. So this open lock can be closed for the needed replace situation, while keeping the state as a spare vdev.

The situation after a vdev replace with spare behaviour can be seen (issue #735) , if you use some files as vdevs. I simulated a zpool whith some hundred vdevs and lots of spares with files. Zpool status gives "spare in use" after a succeeded replace op. But the numbering of the vdev/replace spare is crowded then (issue #725 and #735). But when using a file as vdev .. why is there no lock on the spare vdev file?

There a more errors in this code .. like #725 and #735.

When trying to replace a cache device there is the same EBUSY error. I dont know wheter it is related too.

@dechamps
Copy link
Contributor

But when using a file as vdev .. why is there no lock oon the spare vdev file?

I'm guessing because the kernel only supports exclusive opens on devices, not files. Or maybe it does, but in a completely different way, so it's not implemented in ZFS (yet).

So this open lock can be closed for the needed replace situation, while keeping the state as a spare vdev.

I agree, but the issue is that the module has to close the vdev before the userspace tool begins to work on it. Not only this would probably require a new ioctl code path, it also means that for a short period of time, the vdev is still a spare device but is not open in the module. Consequently the vdev will enter a new "present but closed" state. AFAIK, the current code is not designed to handle such a state.

Bottom line is, trying to implement this could very well open a whole new can of bugs and regressions. Or maybe I'm wrong and this won't cause any issues because the module isn't supposed to do any I/O on the spare device anyway. I'm not sure. This needs more investigation.

@pyavdr
Copy link
Contributor

pyavdr commented Jul 12, 2012

How does Solaris handle this situation? The short period of "present spare but closed" is the same. Maybe it is possible to close and replace it in an atomic way. There may be some race conditions when a large number of spares are replaced and come back after resilvering at the same time. There should be a test script with a heavy usage scenario to make sure, that this situation can be handled quite well.

It looks like, that this EBUSY situation is totally different than the autoexpand EBUSY situation.

@dechamps
Copy link
Contributor

How does Solaris handle this situation?

I guess it doesn't have to, because there's no exclusive locking, or maybe the userspace tools are locking-aware.

The short period of "present spare but closed" is the same.

Currently there's no "present spare but closed" period, that's my point. IIRC, the module just locks everything when it modifies the pool configuration, which is a very fast, non-blocking operation (in-memory structures only). Sure, we could lock everything during the entire time the userspace tool does its thing, but that sounds like a very bad idea.

It looks like, that this EBUSY situation is totally different than the autoexpand EBUSY situation.

Yup.

@behlendorf
Copy link
Contributor Author

How does Solaris handle this situation?

I guess it doesn't have to, because there's no exclusive locking, or maybe the userspace tools are locking-aware.

That's right, under Solaris exclusive device locking is just advisory. Under Linux it's actually enforced when the O_EXCL flag is used. However, if your just opening the device is user space O_RDONLY then its relatively safe to omit O_EXCL and should be allowed to open the device.

@pyavdr
Copy link
Contributor

pyavdr commented Dec 13, 2012

@behlendorf
Brian, just want to know ... is there any time for work and progress on this issue? Lots of issues have been solved this year ... and it looks like the new issue situation is calming down these days. Great work, but this #250 still needs some work too. Another one is #120, where the situation is fairly unclear. The last remaining is #735. As there is maybe some help with the zfsd from tigloo (see #2), we may look forward to implement a basic zfsd, which may at least trigger a replace with a spare disk.

@behlendorf
Copy link
Contributor Author

My expectation is that we'll be able to get these issues resolved next year. But obviously the more developers contributing the quicker this will happen.

@behlendorf
Copy link
Contributor Author

I've opened pull request #1325 to address this issue but I'd love to see it get some additional testing. Please give it a try.

behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 28, 2013
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL).  On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.

This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable.  It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.

To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.

1) Functions which open the device but only read had the
   O_EXCL flag removed and were updated to use O_RDONLY.

2) A common holder was added to the vdev disk code.  This
   allow the ZFS code to internally open the device multiple
   times but non-ZFS callers may not.

3) An exception was added to make_disks() for hot spare when
   creating partition tables.  For hot spare devices which
   are already opened exclusively we skip creating the partition
   table because this must already have been done when the disk
   was originally added as a hot spare.

Additional minor related changes include:

1) Updating check_in_use() to use a Linux partition suffix
   instead of a slice suffix.

2) is_spare() was moved above make_disks() to avoid adding
   a forward reference.  It was also slightly extended to
   allow a NULL configuration to be passed.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#250
@pyavdr
Copy link
Contributor

pyavdr commented Mar 1, 2013

@behlendorf

Brian, it is basically working now ... I can add a spare disk, replace a vdev with the spare and detach the replaced vdev afterwards. The pool resilvers on every action. Finally adding the detached vdev again as spare works too.
But i cannot share a spare disk between two pools, according to the oracle docs. (http://docs.oracle.com/cd/E19082-01/817-2271/gcvcw/index.html)

zpool create stor2 mirror sdj sdk mirror sdn sdo spare sdt sdu
zpool create stor3 mirror sdx sdw spare sdt sdu
warning: device in use checking failed: Device or resource busy

But the numbering of spares in use is already wrong. It is the open issue #735.

@behlendorf
Copy link
Contributor Author

@pyavdr Thanks for the additional testing. I've fixed the shared spare case (good catch) and after some additional testing merged it in to master. As for the #735 I looked in to it and it doesn't look like a bug. It may not be exactly what you expected but it's working as intended, I'll comment further in #735.

@pyavdr
Copy link
Contributor

pyavdr commented Mar 2, 2013

Ok, there is a nice sample of a shared spare vdev now:

pool: stor2
state: ONLINE
scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    stor2       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
    spares
      sdt       INUSE     in use by pool 'stor3'
      sdu       AVAIL   

errors: No known data errors

pool: stor3
state: ONLINE
scan: resilvered 144K in 0h0m with 0 errors on Sat Mar 2 09:38:28 2013
config:

    NAME         STATE     READ WRITE CKSUM
    stor3        ONLINE       0     0     0
      mirror-0   ONLINE       0     0     0
        spare-0  ONLINE       0     0     0
          sdx    ONLINE       0     0     0
          sdt    ONLINE       0     0     0
        sdw      ONLINE       0     0     0
    spares
      sdt        INUSE     currently in use
      sdu        AVAIL   

errors: No known data errors

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL).  On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.

This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable.  It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.

To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.

1) Functions which open the device but only read had the
   O_EXCL flag removed and were updated to use O_RDONLY.

2) A common holder was added to the vdev disk code.  This
   allow the ZFS code to internally open the device multiple
   times but non-ZFS callers may not.

3) An exception was added to make_disks() for hot spare when
   creating partition tables.  For hot spare devices which
   are already opened exclusively we skip creating the partition
   table because this must already have been done when the disk
   was originally added as a hot spare.

Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix.  And is_spare() was moved
above make_disks() to avoid adding a forward reference.

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#250
fuhrmannb pushed a commit to fuhrmannb/cstor that referenced this issue Nov 3, 2020
We will exchage this op before starting the rebuild
, which will exchange size of the snapshot being rebuild.
If downgraded replica finds that it needs to resize the
volume, it will resize it first then start rebuilding from
the helper.

Signed-off-by: Pawan <[email protected]>
mmaybee pushed a commit to mmaybee/openzfs that referenced this issue Apr 6, 2022
The kernel expects the "invalid_features" to be a NvList-type nvpair,
but we are passing a StringArray-type, resulting in a kernel panic.
This commit changes it to return a NvList.
andrewc12 added a commit to andrewc12/openzfs that referenced this issue Aug 3, 2023
EchterAgo pushed a commit to EchterAgo/zfs that referenced this issue Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants