-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved libblkid/libudev integration #4523
Conversation
Is it necessary to set a name at all? What happens with the udev rules if you don't? |
Unfortunately I can't think of a better way of doing this. What with the ability to rename pools and reguid a pool it's hard to make a unique yet persistent name which is still meaningful. I guess this is as good as you can ask for it. |
@behlendorf I could be missing it (I'm battling a cold at the moment), but I don't see the question about no label in #4517. |
I tested #4523. It remedies the systemd-udev errors appearing in /var/log/messages but the zpool create issue still occurs. It appears systemd-udev is still interfering with the zpool create process. The partitions are made correctly and have the new style of zfs-random64 partition names.
However I can manually create device and partition symlinks in a directory not controlled by systemd-udev I am able to create a pool. I am also able to export it, delete the manually created links and then import the pool. The import uses the /dev/disk/by-vdev/... device links with no errors. I will continue to try and determine the root cause. |
It appears that systemd-udev-settle service may be too slow for the zpool create process. First, Run 'strace zpool create spaceballs raidz A0 A1 A2 A3' The zpool create fails. strace output here (1648 lines): http://pastebin.com/gL6VUwYf Next, Look at performance of systemd using 'systemd-analyze blame'
Looking at /dev/disk/by-vdev the device links are created properly, pointing to the right devices and partitions. The device partitions can be written to and read from. A zpool can be successfully created on these same devices by calling out their /dev/sdX names as opposed to dev links in /dev/disk/by-[path,vdev,id]. In the pastebin (linked above) if you look at lines 479-866, 994-1126, 1254-1383 and 1504-1632 the sequence below repeats for each drive declared in the zpool create command.
It may be that a longer sleep is needed to allow for systemd-udev-settle to finish creating the links. That or perform an exorcism on systemd forcing it to leave RHEL/CentOS 7 and return to the level of Hell where it came from. |
@AeonJJohnson based on the strace output you posted you're definitely past all the usual partition creation logic after the failure occurs. If you take a close look at the log you'll see it polling for the partitions to be created and eventually all four are.
After that it attempts to do the ZFS_IOC_POOL_CREATE ioctl() which odds fails with ENOENT. That's why you're seeing such a generic error message about not being able to create the pool. Unfortunately, the strace log isn't going to show us what error path we hit in the kernel to cause this and there are a few of them. We're going to need to get a more detailed log from ftrace.
|
@behlendorf Here is an ftrace of the zpool create operation. (16MB) |
@AeonJJohnson updated fix which fully resolves the issue. The patch relies in part on the new libudev support in master so it won't apply cleanly to 0.6.5.x. However, we can safely drop those hunks and apply a version to the release branch. |
@rlaager @DeHackEd @don-brady this patch is ready for review, if you have time please look it over and provide feedback. It should address multiple issues which have been hard to reliably reproduce in the past. |
@behlendorf I don't know how much review I can add, except that I'm still unclear on why it's necessary to name the ZFS partitions at all. If it's not necessary, simply not setting a name is much simpler (both in terms of less code and for the system administrator) than setting these randomized names. |
@behlendorf based on reliance on libudev and master when do you expect it would land in a release? 0.7? Sooner? |
@AeonJJohnson definitely by 0.7. No promises but a version of it may make it in to 0.6.5.7. Based on the issue tracker this has clearly been a sore spot for some systems and should be addressed and it's a low risk change change. @rlaager yeah, I'm not sold on the unique partition labels myself. They're better then systemd warnings but in practice I'm not convinced |
@@ -260,6 +260,91 @@ udev_device_is_ready(struct udev_device *dev) | |||
} | |||
|
|||
/* | |||
* Wait timeout milliseconds for a newly created device to be available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really pedantic, I know, but it should be "timeout_ms milliseconds" (with the _ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with pedantic, carry on! The more eyes on this the better. Easy fix, will do.
@behlendorf @rlaager /dev/disk/by-partlabel isn't really useful but the multiple devices with the same partition name triggers errors reported by systemd-udev and may be causing additional time delays in getting the device links created and stable. We already know we are dealing with a sluggish and bloated systemd environment. Making unique part names makes systemd happy and likely quicker to finish executing enumerating and creating links. I think the usefulness of the unique partition names ends there. |
|
||
/* Must have a pathname and it must be absolute. */ | ||
if (v->vdev_path == NULL || v->vdev_path[0] != '/') { | ||
v->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL; | ||
return (EINVAL); | ||
return (SET_ERROR(EINVAL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the three SET_ERROR changes logically related to this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the sense that it would be been far easier to identify the root cause of the problem if they had existed. Initially it wasn't at all clear what part of create path was returning ENOENT.
Is this 50ms delay per-device? (I think so.) Are devices processed serially? (I think so.) Are you concerned about this adding up to meaningful amounts of time on large pools? This may be a dumb question (since I don't fully understand this)... |
@AeonJJohnson: When you say, "Making unique part names makes systemd happy and likely quicker to finish executing enumerating and creating links.", are you saying that unique partition names are better than non-unique partition names? If so, I don't disagree. My question is: are unique partition names better than no partition names? |
@rlaager as someone who as had to perform crisis storage brain surgery on non-backed-up data storage I would say that partitions having unique partition names would be useful in the event you had to get elbow deep in the guts to save data. Outside of those situations where you are drawing diagrams and performing trauma surgery there isn't really a need for partition names. But when you are, valuable. |
@AeonJJohnson Thanks. That's something I hadn't considered. |
The delays in
From what I observed in my testing, checking all the links, as opposed to just one, gave a better indication of if udev had fully settled. Somewhat to my surprise I didn't see a better way to do this with libudev. We could definitely rely on a fallback version as you suggested but my subjective impression was it wouldn't be as accurate. My feeling was the more we could catch prior to the ioctl() the better. |
Thanks! I have no other review questions/comments. |
@rlaager I've addressed your comments, and refreshed the patch with a small fix to I've also added a patch to this stack which resolves the issues from #3043. I've renamed the PR so it's no something of a catch-all for liblkid/libudev integration bugs. |
if (id == 0) | ||
id = (((uint64_t)rand()) << 32) | (uint64_t)rand(); | ||
|
||
sprintf(label_name, "zfs-%016llx", (u_longlong_t) id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use: snprintf(label_name, label_size, "zfs-%016llx", (u_longlong_t) id);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just need a unique ID, could you make id
static and (atomically) increment it on every zpool_label_name()
call? That would remove the need for the random number code.
zpool_label_name() {
static uint64_t id = 0;
id++;
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work for zpool create
, but the trouble is it needs to be a persistent unique ID. We need to avoid the case where a pool ends up with two partitions using the same name. You could see this happening quickly if every zpool replace
ended up generating zfs-0
as the partition name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snprintf()
is a good idea.
When partitioning a device a name may be specified for each partion. Internally zfs doesn't use this partition name for anything so it has always just been set to "zfs". However this isn't optimal because udev will create symlinks using this name in /dev/disk/by-partlabel/. If the name isn't unique then all the links cannot be created. Therefore a random 64-bit value has been added to the partition label, i.e "zfs-1234567890abcdef". Additional information could be encoded here but since partitions may be reused that could result in confusion and it was decided against. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#4517
When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more certain that all the device nodes will existing when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was implemented which includes a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked, however in the rare event it is needed to will prevent a failure. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3708 Issue openzfs#4077 Issue openzfs#4144 Issue openzfs#4214 Issue openzfs#4517
Disable the additional EFI debugging in all builds. Some users run debugging builds in production and the extra log messages can cause confusion. Beyond that the log messages are rarely userful. Signed-off-by: Brian Behlendorf <[email protected]>
When importing a pool using the blkid cache only the device node path was added to the list of known paths for a device. This results in 'zpool import' always using the sdX names in preference to the 'path' name stored in the label. To fix the issue the blkid import path has been updated to add both the 'path', 'devid', and 'devname' names from the label to known paths. Sanity check is done to ensure these paths do refer to the same device identified by blkid. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3043
Latest changes look good to me. |
Disable the additional EFI debugging in all builds. Some users run debug builds in production and the extra log messages can cause confusion. Beyond that the log messages are rarely useful. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#4523
When importing a pool using the blkid cache only the device node path was added to the list of known paths for a device. This results in 'zpool import' always using the sdX names in preference to the 'path' name stored in the label. To fix the issue the blkid import path has been updated to add both the 'path', 'devid', and 'devname' names from the label to the known paths. A sanity check is done to ensure these paths do refer to the same device identified by blkid. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#4523 Closes openzfs#3043
When importing a pool using the blkid cache only the device node path was added to the list of known paths for a device. This results in 'zpool import' always using the sdX names in preference to the 'path' name stored in the label. To fix the issue the blkid import path has been updated to add both the 'path', 'devid', and 'devname' names from the label to the known paths. A sanity check is done to ensure these paths do refer to the same device identified by blkid. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#4523 Closes openzfs#3043
When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes openzfs#4523 Closes openzfs#3708 Closes openzfs#4077 Closes openzfs#4144 Closes openzfs#4214 Closes openzfs#4517
When importing a pool using the blkid cache only the device node path was added to the list of known paths for a device. This results in 'zpool import' always using the sdX names in preference to the 'path' name stored in the label. To fix the issue the blkid import path has been updated to add both the 'path', 'devid', and 'devname' names from the label to the known paths. A sanity check is done to ensure these paths do refer to the same device identified by blkid. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#4523 Closes openzfs#3043
When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes openzfs#4523 Closes openzfs#3708 Closes openzfs#4077 Closes openzfs#4144 Closes openzfs#4214 Closes openzfs#4517
When ZFS partitions a block device it must wait for udev to create both a device node and all the device symlinks. This process takes a variable length of time and depends on factors such how many links must be created, the complexity of the rules, etc. Complicating the situation further it is not uncommon for udev to create and then remove a link multiple times while processing the udev rules. Given the above, the existing scheme of waiting for an expected partition to appear by name isn't 100% reliable. At this point udev may still remove and recreate think link resulting in the kernel modules being unable to open the device. In order to address this the zpool_label_disk_wait() function has been updated to use libudev. Until the registered system device acknowledges that it in fully initialized the function will wait. Once fully initialized all device links are checked and allowed to settle for 50ms. This makes it far more likely that all the device nodes will exist when the kernel modules need to open them. For systems without libudev an alternate zpool_label_disk_wait() was updated to include a settle time. In addition, the kernel modules were updated to include retry logic for this ENOENT case. Due to the improved checks in the utilities it is unlikely this logic will be invoked. However, if the rare event it is needed it will prevent a failure. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Signed-off-by: Richard Laager <[email protected]> Closes openzfs#4523 Closes openzfs#3708 Closes openzfs#4077 Closes openzfs#4144 Closes openzfs#4214 Closes openzfs#4517
Disable the additional EFI debugging in all builds. Some users run debug builds in production and the extra log messages can cause confusion. Beyond that the log messages are rarely useful. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#4523
When importing a pool using the blkid cache only the device node path was added to the list of known paths for a device. This results in 'zpool import' always using the sdX names in preference to the 'path' name stored in the label. To fix the issue the blkid import path has been updated to add both the 'path', 'devid', and 'devname' names from the label to the known paths. A sanity check is done to ensure these paths do refer to the same device identified by blkid. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#4523 Closes openzfs#3043
When partitioning a device a name may be specified for each partion.
Internally zfs doesn't use this partition name for anything so it
has always just been set to "zfs".
However this isn't optimal because udev will create symlinks using
this name in /dev/disk/by-partlabel/. If the name isn't unique
then all the links cannot be created.
Therefore a random 64-bit value has been added to the partition
label, i.e "zfs-1234567890abcdef". Additional information could
be encoded here since partitions may be reused that could result
in confusion and was decided against.
Signed-off-by: Brian Behlendorf [email protected]
Issue #4517