-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Forward-port async dmu support to 2.1 #12166
WIP: Forward-port async dmu support to 2.1 #12166
Conversation
Sorry buildbots - this was like throwing screws in the garbage disposal. Until we can get those 3 commits addressed and force-push this entire branch as one commit atop master, i dont think the buildbots will be able to make sense of this (i could turn it into one gross diff off master so it builds if that's preferred) |
Rebase 2.1 rc6 atop fbf26c2 (openzfs#10377), including updates for: 668115f98f1 e330514ad08 ece24c1 The rebase was executed skipping the following commits to permit testing while requesting assistance from appropriate contributors: 64e0fe1 - ping @amotin for assistance e439ee8 - ping @behlendorf for assistance 336bb3662b - ping @amotin for assistance DO NOT MERGE THIS - IT IS A DIFF OF A REBASE WHICH HAS SKIPPED COMMITS, the commits above *MUST* be resolved before this can be applied to a current branch. Testing: Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied). Zloop execution for 4h with no crashes. FIO and bonnie++ tests in a VM against zvol over a loopback file inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive. FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k block size ZVOL.
f44cc93
to
dd3f3fc
Compare
Well, thankfully the grsec patches are rather unforgiving and caught this little gem:
which the wizard himself suggested may be caused by
because Going to dig into it some more this evening, but really hoping that maintainers take notice and weigh in given that i'm probably not the brightest bulb in the shed when it comes to this stuff and bothering smarter people who otherwise spend their time making the world more secure also doesn't seem proper (ping @behlendorf). |
Is it possible for the |
The passing tests are concerning, but then again, we might not be testing "the right thing."
Spender says that the NULL pointer deref here should be caught by upstream kernels as well, so it should not require having a grsecurity/pax-patched kernel to reproduce by others. Happens pretty consistently to me on this branch, but it did take having that 1G/s underlying block device before it started happening more or less consistently, so probably best to test against nvme backing stores or fast distributed flash. |
@amotin: do you happen to have some time to take a look at this and both the BSD patches it impacts as well as the DMU prefetch bit commit? I think i have some mismatch with upstream so the commit hash isnt displaying properly in the description, but that's one of the major ones i had to skip while trying to get this working. |
@sempervictus Heh, I was assigned to this PR because I have volunteered to help Brian out. I do have quite a bit of knowledge about the dmu and zvol code paths, but the crash above falls within the new code path introduced in this PR, which I have no experience with. If you can reproduce this issue reliably, I suggest you start adding some debug to try to pinpoint what is happening. |
@mmaybee - thanks for the clarification. Roger, wilco |
Actually, @mmaybee, would you be able to take a look at the commits i had to skip for now? |
I can’t help, but I would really like to see this merged. Hopefully it can get in! |
There's a bunch of work that needs to happen for this to merge-in - and unfortunately i'm still pretty much completely heads down to even parse out the laggards and accumulated merge issues since then, much less start tackling them in any organized manner. I am however open to bounty payouts to devs who do have cycles and can knock this out. |
Have you tried contacting @wca afaik the original authors of these commits if he is available? |
@scineram: my understanding is that the author of the branch from which i pulled this moved to another company and stopped work on this at that point. That said, if @wca is the actual originator, or if anyone can help fill in the provenance of this code so that we could try to find someone familiar enough to finish it - would be great to have that data memorialized here. |
@sempervictus my hope is that once the async DMU and CoW changes are in, there will be no in-ZFS blocking for I/O under any normal situation, and if userspace uses |
Bounty was mentioned in this thread, so if interested in crowd-funding zfs please see: #13397 |
Which regressions would merging this introduce? |
It shouldn't build at all, will probably eat your data and your pets if it does, should be considered tire-fire-ware until a competent party has looked into it and marked-up where and what needs changing. |
I wonder if iXsystems would be willing/able to hire someone to do the work. |
We have no particular plans at the time. Considering how much it complicates ZFS code, I personally not sure I'd like to see it in. |
How big are the potential performance wins? Could enough of it be merged to at least undo the zvol performance regression? (#8472 and #11407, among others). |
@amotin - i think its inevitable: high-performance IO stacks are async these days with massive parallelism and shallow queues. If we don't adopt something like this (i take your point about complexity, every time i dig into this PR my headache gets worse), we will become the modern version of tape drives - just archival storage (and S3 will be winning that fight anyway because "its easy" and inherently distributed in its various formats). We have to work at the speed of modern storage because people pay a lot for that speed and they won't or can't give up that investment because performance quotients for large operations dictate cost of operations in cloud rather directly and on-prem once the costs are fully broken down. The time ZFS spends blocked and waiting for operations is literally money to cloud users and reason-enough for their consumers to demand something faster. If we stick to the synchronous approach, we simply can't support real world workloads at the performance expectations of their users. Take iSCSI backed by ZVOLs for example: you can get passable performance for a new ZVOL today, but after you fill its extents even once, it will degrade horrifically and become functionally unusable for tons of consumers while it figures out whether it can place data into a hole within the extent of the ZVOL. If that calculation is done async, in the background, then the "front-end" can service other requests unless the IO issued to the zvol was intentionally synchronous as part of some serialized set of actions. |
Is the solution to port parts of ZFS to another language? Rust has great async support and is already supported in the Linux kernel. I see no reason why FreeBSD and illumos could not also add support. |
After 3 years on inactivity this PR got stale again and grown new conflicts. Considering how complicated it is on top of already complicated DMU code, I am not very optimistic about its future. One way or another in its present shape this PR is not very useful, and I so am closing it. If somebody decide to give it another spin, it can always be found and resurrected. |
Motivation and Context
This effort was being implemented by @mattmacy prior to his departure for another venture earlier this year. The work contained should breathe new life into ZFS with the world running on nvme now for several years, and nvmeof appliances becoming the standard datacenter block storage paradigm - shallow but wide queue arrangements for parallel IOPs.
This is not a complete forward-port, as noted below, i had to skip 3 commits to request their authors/contributors' assistance.
This PR should not be merged until these are resolved as it would introduce 3 regressions into the codebase.
Description
Rebase 2.1 rc6 atop fbf26c2 (#10377), including updates for:
668115f98f1
e330514ad08
ece24c1
The rebase was executed skipping the following commits to permit
testing while requesting assistance from appropriate contributors:
64e0fe1 - ping @amotin for assistance
e439ee8 - ping @behlendorf for assistance
336bb3662b - ping @amotin for assistance
How Has This Been Tested?
inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive.
block size ZVOL.
Types of changes
Checklist:
Signed-off-by
.