-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for hotplug issues #13797
Fix for hotplug issues #13797
Conversation
cc: @amotin, @freqlabs |
For more context: The first part of this change deals with the fact that udev caches blkid information, which includes fields parsed from vdev labels:
The The second part of this change deals with device removal. There exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can continue long about cosmetics, but I have no real objections.
Couple of thoughts before I take a look at the code: We've debated in the past whether or not to remove a vdev from a pool on a udev remove event. On the surface it seems perfectly logical. udev tells you the disk has been removed, so remove it from the pool. Makes sense. However, at very large scale (1000s of disks), you often see odd disk behavior that you have to account for. Like, multiple NVMe drives, on multiple nodes, all simultaneously disappearing and re-appearing within a few seconds of each other. I've seen SAS disks disappear/re-appear as well. And as long as the disks come back within the IO timeout amount of time, and are healthy after they come back, then the user wouldn't notice anything (maybe even more so if you're running Lustre on top of ZFS, with it's own timeouts/retries on top of it?). Just to quantify this strangeness a little - I'm seeing over 1000 zed "EC_dev_remove" events across all our disks in the last month (of which only a handful are actual disk replacements). Had this "remove vdev on udev remove" been enabled, we would have seen tons of pools fault at the same time, causing mass downtime and admin intervention. Since we didn't have it enabled, there were no issues - all the IOs rode though the intermittent period of disks disappearing/re-appearing. In short, I think you should make this configurable as a new Also, I'm not in favor of the UNAVAIL -> REMOVED rename. "UNAVAIL" already has a lot of historical weight (just google "ZFS unavail"), and it might break scripts if we rename it. UNAVAIL also implies that the drive could come back to life later, which would certainly be accurate in the case of our disappearing/re-appearing drives. |
@tonyhutter From that perspective I think it would be reasonable for ZED to just not make too quick extra movements like kicking in spares for few minutes after disk disappeared just in case it return. Any way rebuild will likely take hours, so few minutes do not matter much. Pardon me if that logic is already there. But if disk really disappeared according to the block level, it won't handle any new I/O, so hiding it from ZFS is not very productive. On opposite, we saw that disk references held by ZFS cause actual problems for NVMe and surrounding subsystems, that can't free the resources. We also have a huge fleet of ZFS systems, having may be not multiple thousands, but up to a thousand disks each, just based on FreeBSD. And on FreeBSD disk disappearance is immediately reported to ZFS straight within kernel without any additional daemon participation. Daemon handles only replacement. And we feel good enough about that. |
We found more issues with removal detection to be fixed. |
Yea, something like an "autoremove after N seconds" timer would be good. We would probably want the autoreplace timer to be longer than the IO timeout (plus the IO retry). The autoremove timer could be added in a future PR. |
Physically removed vdevs are already supposed to transition to VDEV_STATE_REMOVED on IO errors even on Linux, but it doesn't work for several reasons.
Beyond that, the This would affect older kernels with UPDATE: even SAS disks report 0 for removable. This ever working would be very rare indeed. |
@freqlabs I just added an updated version and split my work into multiple commits. |
6be1cf4
to
be4e34c
Compare
I tested this a little using a draid1:4d:9c:1s pool of 9 NVMe drives. I tested powering off/on a NVMe drive using |
@tonyhutter Thank you for your description. I debugged the issue and it seems to be related to virtio disks. Vdev should have devid "/dev/disk/by-id", so it can be identified during removal. However, by default virtio disks do not seem to have devid unless the serial attribute is hardcoded for the specified disk (systemd/systemd#17670). Even if we provide the serial attribute for virtio disks, |
6a45dd2
to
4bd16a4
Compare
@ixhamza - I went back to testing on real-world NVMe devices and have some new results. Good news is that the cache device now correctly shows REMOVED on removal:
The bad news is that the spares do not show REMOVED:
I also verified that in both removal cases (the cache and spare), that udev remove events were generated. So ZED should be able to key off those removal events for spares as well, although it might require some additional code. |
@tonyhutter you are right, hotplugging of spare is indeed not supported in this patch since it does not contain a label that we use to identify other vdevs. |
Yes, we can address handling spares next. |
@behlendorf, @tonyhutter It would be appreciated if you guys can provide feedback or tag relevant guys here to review this PR. These patches are very critical for our TrueNAS BETA release. |
@ixhamza apologies for the delay, I'll take a look right now |
lib/libzfs/libzfs_pool.c
Outdated
/* | ||
* Remove the specified vdev. Called from zed on udev remove event. | ||
*/ | ||
int | ||
zpool_vdev_remove_wanted(zpool_handle_t *zhp, const char *path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both zpool_vdev_remove_wanted()
and zpool_vdev_remove()
are exported by libzfs and have the same signature:
/*
* Remove the given device.
*/
int
zpool_vdev_remove(zpool_handle_t *zhp, const char *path)
Can you add some additional comments to differentiate what zpool_vdev_remove_wanted()
does vs zpool_vdev_remove()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. zpool_vdev_remove_wanted()
asynchronously removes the vdev whereas zpool_vdev_remove()
does it synchronously. I will test it again to recall what issues I faced when using zpool_vdev_remove()
and will add the comment accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just update the comment. Thanks!
@ixhamza it looks good - I don't really don't have any more comments. I would recommend you squash your commits and use the (slightly modified) commit message:
|
3d01ee7
to
c357488
Compare
@tonyhutter Thank you so much for your review and for verifying it on your hardware. I have incorporated your feedback into this PR. |
I've offered a few more suggestions in private for Ameer to incorporate, including fixing a nearby memory leak when "remove" events are ignored. |
ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be propertly marked as REMOVED. Signed-off-by: Ameer Hamza <[email protected]>
Thank you @freqlabs. I have incorporated your suggested changes. |
ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#13797
ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#13797
ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#13797
ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be properly marked as REMOVED. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#13797
Motivation and Context
Fix for hotplug issues
Description
ZED relies on udev to match vdev guids when the device is removed. However,
udev does not contain the correct blkid information for the vdev due to
which the vdev failed to match when detached and we have to rely on
fault handler to make the device unavailable. This PR allows the vdev to trigger
a Disk Change event whenever a new vdev is added to sync blkid
information with udev.
This PR also changes the device state to REMOVED whenever the device is
unplugged instead of UNAVAIL.
How Has This Been Tested?
By hotplugging vdevs through QEMU
Types of changes
Checklist:
Signed-off-by
.