Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 808 #827

Closed
wants to merge 2 commits into from
Closed

Issue 808 #827

wants to merge 2 commits into from

Conversation

behlendorf
Copy link
Contributor

This is a reworked version of dechamps/zfs@2f30140 patch to move to partition rescanning from userspace to kernel. I had two primary concerns with the original patch.

  1. It used symbols which were not available to us on all kernels, and older legacy APIs were not available.

  2. Even if those symbols were available, they are not quite what we need. Mainly using blkdev_get() to get the device is not 100% identical to vdev_bdev_open(). The former function does not properly claim the block device with a holder while the latter does. The only available API to do this requires that the device be opened by path.

This reworked (largely untested!) patch attempts to work around the above issues by using more common symbols. To do this we unlink the original device node, trigger the BLKRRPART, and then block waiting for the expected device to appear. Unfortunately, even this approach relies on the get_gendisk() symbol which exists in RHEL6 but is not exported. So after implementing this I still don't much care for it.

Upon further reflection I believe the only safe and portable thing to do here is call back in to user space to rescan the devices. This should allow us to easily rescan the partition tables, trigger udev, and then block waiting for the required links to be created. The vdev_elevator_switch() function provides an example of how this can be done.

We can use either the partprobe or hdparm -z utilities to force the rescan and then rely on the udevadm utility to block until udev settles and the devices are available. Something like:

partprobe [device]; udevadm trigger; udevadm settle;

Comments? Suggestions?

behlendorf and others added 2 commits July 12, 2012 12:22
Previously, the zfs.release file was created at 'make install' time.
This is slightly problematic when the file is needed without running
'make install'. Because of this, the step creating the file was removed
from 'make install' and replaced with a more appropriate zfs.release.in
file.

As a result, the zfs.release file will now be created earlier as part
of the 'configure' step as opposed to the 'make install' step.

Signed-off-by: Prakash Surya <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.
@behlendorf
Copy link
Contributor Author

@dechamps Can you comment on proposed user mode helper approach described above. If you agree it's the only safe thing to do can you propose a patch. As for this particular patch I just wanted to push it so you could have a look even though as I say I'm not too fond of it.

@dechamps
Copy link
Contributor

Mainly using blkdev_get() to get the device is not 100% identical to vdev_bdev_open(). The former function does not properly claim the block device with a holder while the latter does. The only available API to do this requires that the device be opened by path.

I believe it does, actually: blkdev_get_by_path just calls lookup_bdev then blkdev_get. lookup_bdev locks the device by calling bd_acquire which calls bdget which calls iget5_locked. My patch is just calling bdget directly then blkdev_get, so the behavior is pretty much the same. In fact, earlier versions of my patch actually had bugs like the one you're describing, but they were obvious to spot (the kernel complains very loudly when the reference count is off), so I feel confident that the final version is behaving correctly.

Of course, your first argument (unavailable symbols) still holds.

Upon further reflection I believe the only safe and portable thing to do here is call back in to user space to rescan the devices. This should allow us to easily rescan the partition tables, trigger udev, and then block waiting for the required links to be created. The vdev_elevator_switch() function provides an example of how this can be done.

I'll take a closer look at your propositions and your patch on Monday. At first glance however, I find calling userspace tools in such a critical code path quite scary: remember that during all this time, the vdev we're expanding is closed, locked and unusable. Now imagine for a second that the userspace tools you're calling need to read from the vdev you're expanding (which could very well happen if we're running ZFS on root), and you have a nice deadlock on your hands. Even without actually calling userspace, just waiting for udev (which is an userspace tool) to settle would probably result in the same issue.

Note that this is an issue for vdev_elevator_switch too, although in this case the deadlock is less likely to happen because it's just calling /bin/sh, which is very likely to be in cache. Also, in the case of vdev_elevator_switch, the fix is easy: just run the usermode helper asynchronously. But I digress.

Also, I have to ask: what does fb7eb3e have to do with all this?

@behlendorf
Copy link
Contributor Author

I believe it does, actually: blkdev_get_by_path just calls lookup_bdev then blkdev_get.

Let me revisit this, perhaps I was too hasty in my initial assessment.

At first glance however, I find calling userspace tools in such a critical code path quite scary:

Yes, I'm not thrilled about it either. I'd much prefer to do this entirely in the kernel. But that all hinges on us finding a reliable and portable way to get the block device. I suppose we could provide patches for the kernels which don't expose the right API but I really want to avoid that if at all possible.

Also, I have to ask: what does fb7eb3e have to do with all this?

Absolutely nothing. It's an unrelated change which was ready to get merged, it relates to getting proper DKMS support for RPMs in place. Hopefully, the rest of that change will get finalized in the next week or two. See:

openzfs/spl#136
#830

@dechamps
Copy link
Contributor

Yes, I'm not thrilled about it either. I'd much prefer to do this entirely in the kernel. But that all hinges on us finding a reliable and portable way to get the block device. I suppose we could provide patches for the kernels which don't expose the right API but I really want to avoid that if at all possible.

I would agree with you if this was a critical feature, but it's really not. Most users will never use dynamic vdev expansion, and some would be happy with offline vdev expansion (I mean zpool online -e; zpool export; partprobe; zpool import which, if I'm not mistaken, should work with the patches you've already merged). I mean, we're talking about a subset of users who need dynamic vdev expansion, and can't do with offline expansion, and run old kernels, and are unable to upgrade their kernels. I would bet it's a very small subset, and it's shrinking over time. So maybe we could simply disable the feature on kernels which don't expose the necessary interfaces, and return a meaningful error message with a suggestion to upgrade the kernel.

I must insist on the potential deadlock I mentioned; in fact, there is a very real, very common scenario where the deadlock has a high probability of happening:

  • Consider a VM using ZoL on root with a single virtual disk.
  • The admin notices the VM is running out of disk space, so it expands the virtual disk (most virtualization software like ESXi can do this).
  • The admin then issues zpool online -e.
  • ZFS closes the vdev, then tries to run some userspace tools to rescan the partition table.
  • The userspace tools are not entirely in cache (this is likely when talking about partprobe or udev rule files), so they try to load from the filesystem, i.e. ZFS.
  • Deadlock. Admin cries.

In my opinion, when faced with a partial but safe solution and a complete but dangerous solution, we should definitely pick the former.

@behlendorf
Copy link
Contributor Author

You make a good point. In this particular case it may not be worth the trouble of supporting older kernels if we're forced to do it in a fragile way. However, at a minimum we do need to cleanly handle the compatibility issues at build time and generate a useful error message to users.

The big sticking points for compatibility appear to be:

  • torvalds/linux@e525fd8 - 2.6.37 API change which added the holder to blkdev_get(). Incidentally, this is where I got the idea that the holder was not being set, I was reading the RHEL6 source (yes, I should know better).
  • torvalds/linux@b6ac23a - 2.6.34 API change which exports the get_gendisk() symbol.

That means it looks like at a minimum a 2.6.37 kernel is required. That's a little bit unfortunate since I'd have liked to have seen this support available in an exterprise distribution like RHEL6. I guess I'll have the harass them and see if they are will to backport those upstream commits.

I'll refresh this pull request today with your original patch, some minor cleanup, a few fixes, and autoconf checks for the above cases. If you can review and verify those changes that would be helpful!

@behlendorf
Copy link
Contributor Author

@dechamps Alright, I've refreshed the patch for your review behlendorf/zfs@ebfff83 , changes include:

  • blkdev_get() and get_gendisk() autoconf checks. In case either function is not available support for online expansion is disabled. At the moment there there will be no notification via the tools that online expansion isn't support, we should probably add something.
  • Squashed your two patches together
  • Minor formatting fixes (80 column, etc)
  • Added proper usage of IS_ERR/ERR_PTR/PTR_ERR functions which is required because vdev_bdev_open() will return errors this way.
  • Verified build on a sampling of 2.6.26 - 3.4 kernels.

@dechamps
Copy link
Contributor

Alright, I've refreshed the patch for your review behlendorf/zfs@ebfff83

I tested this patch and confirm that the feature works. I'm OK with the changes you made. You can go ahead and merge it.

@behlendorf behlendorf closed this Jul 17, 2012
@behlendorf
Copy link
Contributor Author

Merged in to master as b5a2880

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request Sep 26, 2023
In order to aggregate contiguous i/os together, we dispatch all i/os
within the same `DISK_AGG_CHUNK` (default 1MB) of the disk to one
thread.  This thread is responsible for aggregating and issuing i/os for
this chunk.  However, the thread can only perform one `pread()` syscall
at a time.  So, if we have several i/os queued to one `reader_thread()`,
and they are not actually aggregated together, the queued i/os are
delayed, waiting for each previous i/o to be completed.  Performance
could be increased if we issued all the queued i/os in parallel.

Given the current design of the `reader_thread`s, the simplest
improvement is to adjust the compromise between ability to aggregate,
and ability to issue non-aggregatable i/os in parallel.  This commit
decreases the default `DISK_AGG_CHUNK` from 1MB to 128KB.  Note that our
goal is to create i/os of at least 16KB, so that we hit the throughput
limit of a typical disk, rather than the IOPS limit.  128KB is more than
adequate for this goal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants