-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 808 #827
Issue 808 #827
Conversation
Previously, the zfs.release file was created at 'make install' time. This is slightly problematic when the file is needed without running 'make install'. Because of this, the step creating the file was removed from 'make install' and replaced with a more appropriate zfs.release.in file. As a result, the zfs.release file will now be created earlier as part of the 'configure' step as opposed to the 'make install' step. Signed-off-by: Prakash Surya <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]>
Currently, zpool online -e (dynamic vdev expansion) doesn't work on whole disks because we're invoking ioctl(BLKRRPART) from userspace while ZFS still has a partition open on the disk, which results in EBUSY. This patch moves the BLKRRPART invocation from the zpool utility to the module. Specifically, this is done just before opening the device in vdev_disk_open() which is called inside vdev_reopen(). This requires jumping through some hoops to get to the disk device from the partition device, and to make sure we can still open the partition after the BLKRRPART call. Note that this new code path is triggered on dynamic vdev expansion only; other actions, like creating a new pool, are unchanged and still call BLKRRPART from userspace.
@dechamps Can you comment on proposed user mode helper approach described above. If you agree it's the only safe thing to do can you propose a patch. As for this particular patch I just wanted to push it so you could have a look even though as I say I'm not too fond of it. |
I believe it does, actually: Of course, your first argument (unavailable symbols) still holds.
I'll take a closer look at your propositions and your patch on Monday. At first glance however, I find calling userspace tools in such a critical code path quite scary: remember that during all this time, the vdev we're expanding is closed, locked and unusable. Now imagine for a second that the userspace tools you're calling need to read from the vdev you're expanding (which could very well happen if we're running ZFS on root), and you have a nice deadlock on your hands. Even without actually calling userspace, just waiting for udev (which is an userspace tool) to settle would probably result in the same issue. Note that this is an issue for Also, I have to ask: what does fb7eb3e have to do with all this? |
Let me revisit this, perhaps I was too hasty in my initial assessment.
Yes, I'm not thrilled about it either. I'd much prefer to do this entirely in the kernel. But that all hinges on us finding a reliable and portable way to get the block device. I suppose we could provide patches for the kernels which don't expose the right API but I really want to avoid that if at all possible.
Absolutely nothing. It's an unrelated change which was ready to get merged, it relates to getting proper DKMS support for RPMs in place. Hopefully, the rest of that change will get finalized in the next week or two. See: |
I would agree with you if this was a critical feature, but it's really not. Most users will never use dynamic vdev expansion, and some would be happy with offline vdev expansion (I mean I must insist on the potential deadlock I mentioned; in fact, there is a very real, very common scenario where the deadlock has a high probability of happening:
In my opinion, when faced with a partial but safe solution and a complete but dangerous solution, we should definitely pick the former. |
You make a good point. In this particular case it may not be worth the trouble of supporting older kernels if we're forced to do it in a fragile way. However, at a minimum we do need to cleanly handle the compatibility issues at build time and generate a useful error message to users. The big sticking points for compatibility appear to be:
That means it looks like at a minimum a 2.6.37 kernel is required. That's a little bit unfortunate since I'd have liked to have seen this support available in an exterprise distribution like RHEL6. I guess I'll have the harass them and see if they are will to backport those upstream commits. I'll refresh this pull request today with your original patch, some minor cleanup, a few fixes, and autoconf checks for the above cases. If you can review and verify those changes that would be helpful! |
@dechamps Alright, I've refreshed the patch for your review behlendorf/zfs@ebfff83 , changes include:
|
I tested this patch and confirm that the feature works. I'm OK with the changes you made. You can go ahead and merge it. |
Merged in to master as b5a2880 |
In order to aggregate contiguous i/os together, we dispatch all i/os within the same `DISK_AGG_CHUNK` (default 1MB) of the disk to one thread. This thread is responsible for aggregating and issuing i/os for this chunk. However, the thread can only perform one `pread()` syscall at a time. So, if we have several i/os queued to one `reader_thread()`, and they are not actually aggregated together, the queued i/os are delayed, waiting for each previous i/o to be completed. Performance could be increased if we issued all the queued i/os in parallel. Given the current design of the `reader_thread`s, the simplest improvement is to adjust the compromise between ability to aggregate, and ability to issue non-aggregatable i/os in parallel. This commit decreases the default `DISK_AGG_CHUNK` from 1MB to 128KB. Note that our goal is to create i/os of at least 16KB, so that we hit the throughput limit of a typical disk, rather than the IOPS limit. 128KB is more than adequate for this goal.
This is a reworked version of dechamps/zfs@2f30140 patch to move to partition rescanning from userspace to kernel. I had two primary concerns with the original patch.
It used symbols which were not available to us on all kernels, and older legacy APIs were not available.
Even if those symbols were available, they are not quite what we need. Mainly using blkdev_get() to get the device is not 100% identical to vdev_bdev_open(). The former function does not properly claim the block device with a holder while the latter does. The only available API to do this requires that the device be opened by path.
This reworked (largely untested!) patch attempts to work around the above issues by using more common symbols. To do this we unlink the original device node, trigger the BLKRRPART, and then block waiting for the expected device to appear. Unfortunately, even this approach relies on the get_gendisk() symbol which exists in RHEL6 but is not exported. So after implementing this I still don't much care for it.
Upon further reflection I believe the only safe and portable thing to do here is call back in to user space to rescan the devices. This should allow us to easily rescan the partition tables, trigger udev, and then block waiting for the required links to be created. The vdev_elevator_switch() function provides an example of how this can be done.
We can use either the
partprobe
orhdparm -z
utilities to force the rescan and then rely on theudevadm
utility to block until udev settles and the devices are available. Something like:Comments? Suggestions?