-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdev_iokit branch #178
Comments
@lundman In both, if multiple vdevs are found with a matching guid or pool name, the one with the best txg number is used. by_path simply checks for a matching BSD name and fails if it can't be found or opened. |
@evansus
|
Primary path is as close to Illumos vdev_disk.c as possible - 2 and 3) Haven't tested failures and unexpected disconnects as yet. I verified that mirrored and raidZ were functional at one point, but haven't retested with some of the more recent changes. Most of the vdev_iokit.c code is following Illumos vdev_disk.c. I started with a proof-of-concept and then reworked this file to follow upstream as much as possible. vdev_iokit_util.cpp is more loosely based on bits from vdev_geom.c from FreeBSD. ZoL is similar to upstream vdev_disk.c, but then diverges in vdev_disk_open: Instead using vdev_bdev_open (actually a #define macro) and vdev_disk_rrpart to detect and open disks. And yes, relying on udev by-id or custom udev rules to locate changed disks.
I'm running this on my retina MacBook Pro, and haven't had any issues. Running 10.9.3 and using a GPT partition on the internal SSD as a single-disk pool. For testing I use external USB 2 and USB 3 hard drives and flash drives, as well as VMWare fusion VMs. Please let me know if you have additional questions! |
Yes the code similarities between vdev_iokit, geom and Solaris are comforting, and it is good you have exercised the paths. I have testing this branch as well, and have not experienced any issues. (But I don't have path renumbering issue though). at some point though, I am hoping you will remove much of the commented out code, and work on indenting so we can give it a proper review :) I am of course guilty of such things too... |
True. I recently browsed these on github.com and saw that the whitespace etc are all over the place - have been working from Xcode mostly. I'd be happy to do some cleanups etc. |
I used to have emacs set to "uboot"'s coding standard, which is quite unusual (space, not tabs). Only recently changed emacs to use tabs again (last week) as ZFS uses tabs. We could consider following ZOLs practise, I think they have a script to check code-standard. Dunno if we have to be so strict, but it could be something we should aim toward;
heh all me :) |
About testing changed device paths- I may be restating the obvious, but in case it helps: Simplest test is creating two disk images and opening one after the other, then reverse the order. You can also simulate device renumbering by either connecting/reconnecting USB devices in different orders, or by making any change to the disk list between connects. Opening or creating a disk image, ramdisk, zvol, etc. |
Ahh so simple, and so beautiful. Yes, that should have been obvious :) |
Tested mirrored and raidz with latest vdev-iokit branch. With either:
Haven't tested device failure while in use, or path renumbering. I'll look into resolving the missing vdev issue. |
Fixed, was due to an addition to vdev_get_size. |
This is much better, thanks. Also appreciated you brought it up in line with master. This can probably be merged into master soon - do people feel ready for it? |
I count seven added #if 0's. Any chance we can clean out the dead code before merging? |
I've been beating vdev-iokit head (+ f0a31c6) quite a bit and it seems pretty solid. |
@rottegift That sounds like unreleased snapshot holds, most likely from an interrupted zfs send/recv? I use this one-liner to check all snapshots for holds recursively:
|
Yeah I realised it was unreleased snapshots after I left the comment, so deleted the comment, but probably not fast enough to stop you from seeing an email copy. (The unreleased snapshots go away across an export/import or reboot, thus the confusion). c.f. #173 P.S.: This will be faster
(Didn't really examine closely, but zfs holds is unhappy with more than about ten args). |
@rottegift, @lundman, and @ilovezfs (and the whole OpenZFS on OS X community), I just committed some enhancements and cleanups. This addressed a few issues, mostly minor, some fairly important. I believe there are a few areas to review, as well as future enhancement areas. 1) flush write cache When called async, we can return ZIO_PIPELINE_STOP, then when the op completes, call zio_interrupt. Instead, we wait for the sync to complete, call zio_interrupt, then return pipeline_stop just after. This is probably OK - should have the same end result, but I'm not sure if there are any negative implications. It may be as simple as adding a 'cacheflush' taskq to perform the sync and callback- I'm open to suggestions. 2) ashift Haven't experimented with the ashift set recently, but in the past setting the ashift would cause pool import to fail (sees 'corrupt' vdevs). I'd like to review this and set it correctly. vdev.c assigns ashift according to what is in the pool's configuration, and checks if child vdevs have their own ashift property. This is working fine with every pool I've tried - including from vdev-iokit, master, and other distributions. Also I tested zpool create with no ashift specified (uses 9), -o ashift=9, and -o ashift=12, and verified using zdb that all worked as expected. I'm not sure what would happen with a more complex pool layout - for example a zpool created as ashift=12, but with one vdev that was inadvertantly added with default ashift of 9, and vice-versa. However for the common use-cases - pools with default ashift, ashift=9, and ashift=12, this has been working fine with every pool I've tried. I've imported and used pools created on FreeBSD, and created on the master branch of OpenZFS on OS X. 3) simple block devices As a potential workaround, I know that userland uses the vdev_file_ops for both files and block devices, so I'm sure we could find another way to interface with standard block devices if needed. |
_2) ashift There is indeed a difference here and with other distributions. It took us quite a while to work out why. We need to use the vdev_ashift value on the blocknum when translating offset requests, when upstream always uses 512. We eventually drilled down to that Darwin is actually large block aware in the lowest layer, when upstream stick with 512 for block to offset translation. Some upstream distributions use "byte offset" to avoid this, Darwin still uses "block number". As in, the code: would always go to 512 blocks (lbtodb). Whereas we use but I see in the iokit code, the equivalent call is; I've been out of commission for a few days, but hope to get back into code review and testing this branch. |
Each call to |
@rottegift Glad to hear that! @lundman yes I agree. At least the vdev_iokit_context_t is currently a struct of just a few pointers:
The completion struct is defined in IOStorage.h:
and IOStorageCompletionAction is also just a pointer to the callback function (from IOStorage.h):
But yes it still is an alloc per IO, in this rev. I don't know if you noticed the IOCommandPools I tested in the previous rev, bit of an experiment: ccde144 Along with the allocation, the bigger issue is IOBufferMemoryDescriptor::withAddress and ::prepare being called from vdev_iokit_strategy. (called from vdev_iokit_io_start). At least afterwards it can issue the read() and write() calls async. The best way to optimize this would be changing vdev_iokit_strategy so that all of its actual work is performed asynchronously, correct? (Perhaps using taskq?) Other Async/sync function calls: Currently this vdev-iokit branch (partially) issues async reads and writes, with completion callback. However it would also be good to issue the flush cache requests 'async' with a completion callback, since it's async on all upstream repos. Down the line, possibly Unmap, too - if/when we have an acceptable upstream ZFS trim patch to merge. :) Looking at master, it seems vdev_disk.c has a synchronous vdev_disk_io_start, blocking for the duration of reads and writes. Aside from that are a few minor issues with vdev_disk which I addressed in this branch: Another example is doAsyncReadWrite in zvolIO.cpp. Would taskq's be an appropriate way to handle these function calls asynchronously? (By calling C -> C++ extern/wrapper functions like we're already doing?) |
Yes, I estimated your As for Similarly 3rd: I was under the impression that calling But I am new to iokit, so expect confusion :) |
Yes, the IOMedia::write is async, however that is called after the IOBufferMemoryDescriptor is allocated and prepared. In The io_context and BufferMD allocation/prepare are already being done in About the cache flush, first I noticed that other platforms are issuing an async call. So I was thinking it would be best to handle this in the same way as other platforms. rather than an async flush, and cleanup in vdev_disk_ioctl_done. Perhaps this would be a good question to post to the OpenZFS mailing list. I have noticed that on a working system (not experiencing obvious issues), spindump shows some pretty long stack traces, where a whole stack is blocked waiting on other calls. This included vdev_iokit_io_start, vdev_iokit_strategy, vdev_iokit_sync, etc. From some off-cpu flame graphs, it appears https://gist.github.com/evansus/e0e34b60ba6dd993b4be I'm new to the flame graphs as well though, so I could be using a poor dtrace script. I wouldn't be surprised if my dtrace script is poor, I started with a basic script from Brendan Gregg's blog, and changed it to profile some vdev-iokit functions: How does that look? Also, from the IOBufferMediaDescriptor documentation, I was under the impression that bufferMD does allocate it's own memory.
I had hoped to use IOMemoryDescriptor instead, but MD::withAddress hasn't successfully read or written data into the zio->io_data buffer, and bufferMD::withAddress works fine. Edit: Nevermind, read the number wrong. 80 milliseconds, not 800. |
The flamegraphs are neat, but not sure that I can help there, it's a bit like black magic.. I'm sure you saw my wiki entry on them already, which is pretty much the culmination of my knowledge. Once I started to grep out just ZFS calls it became more useful though. I didn't check the iokit sources, but the description is
Which implies to me that we create a new descriptor, using the given address, but do not allocate more memory. We will need to peek in the sources to know for sure. Anyway, explore anything that catches your fancy, I was hoping you would move context into zio since it seems undesirable to allocate context in strategy, and worse, dealing with failure of said allocation. :) |
@evansus : when cache vdevs are missing at import, vdev_iokit and master don't automatically deal with the vdevs appearing subsequently, even if the devices match what zpool status -v expects. In master, if the devices have not been renamed, a zpool online pool dev / zpool clear pool dev seems to work. In vdev-iokit I had to zpool remove dev... the cache vdevs and zpool add cache dev... them with their new names. Maybe you could add a fast usb3 thumb drive to one of your test setups to use as a cache vdev as you carry on with the iokit work - they work well for l2arc in front of spinny disks. |
@rottegift Please try the latest commits to the vdev-iokit branch, cache and log devices survive renumbering after the latest changes (also updated from the current master branch). It seems to function on my end with a few tests. I tried both manual import and import from cachefile after opening disk images in different sequence, etc. Pool import is successful in all cases I've tried, with one minor issue to resolve: The solution to this will be updating vdev_path whenever it is necessary to import by physpath or guid, which I'm looking into. |
@rottegift Ah, I re-read your comment and see what you mean. I just tested importing a pool with missing cache devices, then attached the disk image at a new device node - and I ran into the scenario you described.
Renumbered disks that are all available at import time should be fine though. @ilovezfs this does address some of the issues we discussed recently, but broke |
|
Interesting, a ram issue is unlikely in that case. On another note, have you tested with cpus=1 in boot.plist (or just set in nvram)? |
"have you tested with cpus=1" No, not at all. If the panics and/or hangs recur, I will. |
In looking around I believe that some builds I did were done with the wrong clang (macports 3.4 latest built from source rather than xcode's clang). I've now quintuple-checked everything on that front, and will hope for no-recurrence. |
Hmmm, nope. Another one (two scrubs were in progress). I'll try the cpus=1 boot arg now.
|
Biting bullet and doing extended testing in apple hardware test. Memtest all 2 was ok in single user mode. |
"Test results: No trouble found" "Total Time Testing: 1 hour 9 mins 42 secs" So I think we can rule out hardware issues. |
@rottegift Is this limited to vdev_iokit? |
@ilovezfs: no. I'm now trying with cpus=1 as suggested above. It's very slow. What will come first, panic or death-by-old-age? :-) Assuming this does crash the next step is to build a totally pristine new boot drive with nothing but zfs (ideally the latest ftp build) on it and try to replicate conditions in which crashes happen (which mostly seems to be lots of small writes onto slow media, especially into zvols). |
Hm, I don't suppose you already know some bare minimum set of LaunchDaemons needed to make zfs work (pool import/export, dataset mount/unmount/etc) in single user mode? |
I'd just do sudo make install. I wonder if this is zed/zpool.cache related. Line 70 in e34a376
Maybe try building with zfs_autoimport_disable = 1, and do not install the plists. |
I'll try zfs_autoimport_disable = 1 (and master) after running for a bit with cpus=1. The same vdev_iokit source code is on another mac (I'm typing on it now) running 10.8 and with a vastly different load pattern, and has significantly longer uptimes in recent days. Autoimport is working just fine on this machine. On the crashier machine (which was the subject of the hw tests above) vdev-iokit happily autoimports and runs for hours. So I'm not sure if it will make much difference, but I'll try to just to eliminate your lines of thought in suggesting it. |
Well, cpus=1 is nice for showing off how good Mac OS X (and zfs) are at multithreading and multiprocessing. $ sysctl vm.loadavg Those sha256 checksums sure keep a single processor busy. |
@evansus I think the only thing cpus=1 is telling me is that even a light load on this machine cannot be approached by a one-cpu system. It's only: imported ssdpool (which is doing a scrub, so lots of sha256), imported Trinity (ditto), fired up the UI and Safari,Terminal,Activity Monitor,Console and a couple of others, and has been waiting about ten minutes for "$ sudo -s" to even display a Password: prompt. I'm going to attempt a graceful shutdown when it gives me the sudo shell, since ssh-ing in is hopeless and there is nothing giving any useful ability to inspect the workings of the system. |
Interesting - looks like the I wonder if that could be reproduced by creating a small test pool and running zpool scrub on it, first without cpus=1, then with if replicated (and of course master vs vdev-iokit). Otherwise yes, scrubbing the large pools with cpus=1 would take an eternity to complete. :) I still haven't replicated that panic, but incidentally I don't think I've been using checksum=sha256 on test pools. Typically I use something close to But I take it there haven't been any panics while running single-cpu, which might indicate a synchronization issue. It's tricky since replicating one of the specific panics while using cpus=1 would confirm it is not a multi-threading issue. |
Yes, I understand the idea of eliminating a whole range of synchronization and contention problems in going with one cpu, however the system hasn't even managed to mount all three of the pools it normally does, let alone reach a point at which I can do one of the tasks that I think are most likely to correlate with panics. Yes, checksum=sha256 on practically everything, and compression=lz4 on everything. A typ9ical create for me is
I even do that sort of thing on non-rpool pools on illumos
for example. |
Actually I take that back, laptop and mini's pools are checksum=sha256, only the test pools created in a VM had default checksum. Also I did experience a panic when attempting a scrub on my laptop, but it is a unique scenario: That issue is fixable, but unrelated. I need to test this on my Mac Mini, after updating it to ToT vdev-iokit, and in VMs with my updated hotspares script. |
@ilovezfs : " zfs_autoimport_disable = 1 " I'm going to try your latest commits in master for a bit without this, and then with it. I'll raise any problems in another issue, and leave this one only for vdev_iokit for now (unless you prefer it to all be here until we figure out what's causing these panics and hangs). |
A couple more while dealing with the aftermath of the comments attached at the end of 95ff805 These two were in vdev-iokit as I was replacing devices that had gotten stomped with bad labels. The first one was perhaps triggered by a zfs send sourced from one of the DEGRADED pools.
and
|
So |
I got a panic in master earlier but unfortunately it did not leave a crashdump. :-( |
@lundman Re:#201 (comment) yep about a week ago I updated find_by_path to validate vdev GUIDs unless creating or splitting the pool. If the validation fails there, then find_by_guid can kick in. I also had to update the logic for cache and spare devices, since they are labeled differently. See ed1463a, 4857136, 0c6e5cc, 69dcad8 and 170ee29. That resolved the remaining degraded or faulted imports due to renumbered disks, since the devices are checked when there's still a chance to search for them. I've tested with renumbered data, log, cache, and spare vdevs and haven't had issues recently. @rottegift has reported some panics that I haven't been able to replicate or completely decipher. But his pools, with several vdevs of each type per pool, are great test cases. The only two issues I have run into had to due with |
I got a different panic in vdev-iokit with zfs_vdev_async_write_active_min_dirty_percent -> 5 and zfs_dirty_data_max_percent -> 5 (see #201 (comment) ). Qualitatively it took longer to crash, but not by a huge amount, and it happened during the multi-zfs-send task mentioned above that comment, but afaict little or no zvol load.
|
On reboot importing is stuck on a zvol:
A spindump is at https://gist.github.com/rottegift/9611fa933259ec45a7cd |
Sadly another one of these
|
Got home to this. Don't even know what could have done it. It was happily scrubbing a pool and doing not much else while I was out.
|
And another. Unfortunately no time yet to pop onto IRC.
|
@evansus : Import badness : 1f77701 (and with spl @ issue201) and with #define DEBUG 1 in zfs/cmd/zfs_util/zfs_util.c :
vdev-iokit and zfs.util should probably not compete at startup. |
this is all from 2014, so closing this for now |
Thanks for trying out the iokit replacement. It does indeed appear to function well, and receives almost identical benchmark score as master.
https://github.com/openzfsonosx/zfs/blob/vdev-iokit/module/zfs/vdev_iokit.c#L88
Better watchout with those comments, you just took out the clearing on line 93. But since it is the free function, it doesn't matter ;)
https://github.com/openzfsonosx/zfs/blob/vdev-iokit/module/zfs/vdev_iokit.c#L340
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/vdev_disk.c#L416
Is the plan to extend it to attempt to open_by_guid as well? I see you implemented something in vdev_iokit_util.cpp, is it part of your roadmap to add that ability? If it can handle the device /dev/diskX moving, then it would make it quite attractive.
:)
The text was updated successfully, but these errors were encountered: