-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EBUSY when using hot spare #250
Comments
While digging into that case, i observed that it gives the same error "device busy", when trying to replace a cache device, which ist not really a new case. I tried to strip alle the O_EXCL open modes in the code tree, but failing lastly on an ioctl which tries to update the partition table, which is giving up on "device busy". Somewhere in the kernel, there is the needed lock for the hostspares and there is maybe only one hot spot to resolve that issue. I cant find the spot to release the lock before attempting to exchange the vdisks. maybe just a call to vdev remove is needed to resolve this issue. Can´t find the right spot in the code for it. |
Here's the actual stack traces of
These calls fail because the module already opened the device in exclusive mode from kernel space via
That's probably the simplest way. That would make |
https://blogs.oracle.com/eschrock/entry/zfs_hot_spares
If we remove the vdev before replacing it, then it won't be regarded as a spare vdev anymore and we lose this behavior. Maybe we can find a way to make the module close the vdev without making it forget about it entirely. |
You are absolutely right. vdev remove & vdev replace is only a bad workaround. Pls read my discussion with Brian at issue #2. I have a zfsd, which checks the zpool events. First this zfsd should replace an unavailable vdev with a hotspare in case of a vdev is not available ( there should be some checks like smart and i/o errors lateron). Next it may be used to do an zpool autoexpand automatically. I saw this stack traces too, but don´t know the spa_import code path in kernel space. This lock situation is only needed to prevent others to use the spare vdev while it is associated to the zpool. So this open lock can be closed for the needed replace situation, while keeping the state as a spare vdev. The situation after a vdev replace with spare behaviour can be seen (issue #735) , if you use some files as vdevs. I simulated a zpool whith some hundred vdevs and lots of spares with files. Zpool status gives "spare in use" after a succeeded replace op. But the numbering of the vdev/replace spare is crowded then (issue #725 and #735). But when using a file as vdev .. why is there no lock on the spare vdev file? There a more errors in this code .. like #725 and #735. When trying to replace a cache device there is the same EBUSY error. I dont know wheter it is related too. |
I'm guessing because the kernel only supports exclusive opens on devices, not files. Or maybe it does, but in a completely different way, so it's not implemented in ZFS (yet).
I agree, but the issue is that the module has to close the vdev before the userspace tool begins to work on it. Not only this would probably require a new ioctl code path, it also means that for a short period of time, the vdev is still a spare device but is not open in the module. Consequently the vdev will enter a new "present but closed" state. AFAIK, the current code is not designed to handle such a state. Bottom line is, trying to implement this could very well open a whole new can of bugs and regressions. Or maybe I'm wrong and this won't cause any issues because the module isn't supposed to do any I/O on the spare device anyway. I'm not sure. This needs more investigation. |
How does Solaris handle this situation? The short period of "present spare but closed" is the same. Maybe it is possible to close and replace it in an atomic way. There may be some race conditions when a large number of spares are replaced and come back after resilvering at the same time. There should be a test script with a heavy usage scenario to make sure, that this situation can be handled quite well. It looks like, that this EBUSY situation is totally different than the autoexpand EBUSY situation. |
I guess it doesn't have to, because there's no exclusive locking, or maybe the userspace tools are locking-aware.
Currently there's no "present spare but closed" period, that's my point. IIRC, the module just locks everything when it modifies the pool configuration, which is a very fast, non-blocking operation (in-memory structures only). Sure, we could lock everything during the entire time the userspace tool does its thing, but that sounds like a very bad idea.
Yup. |
That's right, under Solaris exclusive device locking is just advisory. Under Linux it's actually enforced when the O_EXCL flag is used. However, if your just opening the device is user space O_RDONLY then its relatively safe to omit O_EXCL and should be allowed to open the device. |
@behlendorf |
My expectation is that we'll be able to get these issues resolved next year. But obviously the more developers contributing the quicker this will happen. |
I've opened pull request #1325 to address this issue but I'd love to see it get some additional testing. Please give it a try. |
The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor related changes include: 1) Updating check_in_use() to use a Linux partition suffix instead of a slice suffix. 2) is_spare() was moved above make_disks() to avoid adding a forward reference. It was also slightly extended to allow a NULL configuration to be passed. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#250
Brian, it is basically working now ... I can add a spare disk, replace a vdev with the spare and detach the replaced vdev afterwards. The pool resilvers on every action. Finally adding the detached vdev again as spare works too.
But the numbering of spares in use is already wrong. It is the open issue #735. |
@pyavdr Thanks for the additional testing. I've fixed the shared spare case (good catch) and after some additional testing merged it in to master. As for the #735 I looked in to it and it doesn't look like a bug. It may not be exactly what you expected but it's working as intended, I'll comment further in #735. |
Ok, there is a nice sample of a shared spare vdev now: pool: stor2
errors: No known data errors pool: stor3
errors: No known data errors |
The issue with hot spares in ZoL is because it opens all leaf vdevs exclusively (O_EXCL). On Linux, exclusive opens cause subsequent exclusive opens to fail with EBUSY. This could be resolved by not opening any of the devices exclusively, which is what Illumos does, but the additional protection offered by exclusive opens is desirable. It cleanly prevents you from accidentally adding an in-use non-ZFS device to your pool. To fix this we very slightly relaxed the usage of O_EXCL in the following ways. 1) Functions which open the device but only read had the O_EXCL flag removed and were updated to use O_RDONLY. 2) A common holder was added to the vdev disk code. This allow the ZFS code to internally open the device multiple times but non-ZFS callers may not. 3) An exception was added to make_disks() for hot spare when creating partition tables. For hot spare devices which are already opened exclusively we skip creating the partition table because this must already have been done when the disk was originally added as a hot spare. Additional minor changes include fixing check_in_use() to use a partition instead of a slice suffix. And is_spare() was moved above make_disks() to avoid adding a forward reference. Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#250
We will exchage this op before starting the rebuild , which will exchange size of the snapshot being rebuild. If downgraded replica finds that it needs to resize the volume, it will resize it first then start rebuilding from the helper. Signed-off-by: Pawan <[email protected]>
The kernel expects the "invalid_features" to be a NvList-type nvpair, but we are passing a StringArray-type, resulting in a kernel panic. This commit changes it to return a NvList.
Signed-off-by: Andrew Innes <[email protected]>
Signed-off-by: Andrew Innes <[email protected]>
Hot spares appear to be broken in the latest ZFS source. When attempting to kick in a hot spare EBUSY is returned when opening the hot spare device. This is likely because O_EXCL is passed as an option. Richard Laager first noticed this issue and reported it on the zfs-discuss mailing list.
The text was updated successfully, but these errors were encountered: