-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore #9406
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cursory glance looks okay, I'll take a closer look and test early next week!
Updated to address @Fabian-Gruenbichler 's feedback. Thanks! |
as discussed in #9346, this PR is apparently still incomplete/broken on some systems.. I'll try to dig in today - please don't merge it yet ;) |
@Fabian-Gruenbichler any assistance would be appreciated, and nothing will be merged until we get this sorted out. Unfortunately, I still haven't been able to reproduce the issue in my testing. I saw in #9346 you were able to reproduce the issue though, would you mind sharing your reproduction steps. |
SETUPDebian Buster, baremetal:
following FPU information is reported on startup:
RAIDZ-1 with 4 SSDs, default 0.8.2 settings test consist of the following operations in parallel:
the scrub gets repeated from time to time if it has finished. whole procedure left running until mprime reports errors, or a few test rounds have passed successfully. I tested the following kernels:
each individual kernel+code combination got tested with the following sequence:
I tested both this PR, current master, and current master with the following diff applied:
RESULTSworks - no ill-effects noticed sometimes the errors get reported within seconds, sometimes it takes a few minutes. I am very happy about the KABI check parallelization btw, without it this would have taken ages :-P PR
=> very consistently broken master
=> breakage switches from no save/restore < 5.2 to broken save/restore >= 5.2 ? patched master
=> save restore works on < 5.2 (with backported _err restore), breaks on >= 5.2 ? 5.2 introduced the "defer FPU state load until return to userspace" change, which seems like a likely culprit for why (patched) master breaks then. but OTOH, this would mean that the code in master never worked for any kernel (< 5.2 was broken since it did no copy/restore whatsoever, and >= 5.2 did a broken copy/restore)? I don't understand yet why the PR is broken across the board - it does make sense that it is broken on >= 5.2 since it does essentially the same as master (I started some tests yesterday, but those were all on 5.3 which I only realized late in the testing was broken for master as well, so most of that testing went into the wrong direction :-/) |
okay, so I dug a little deeper and AFAIU the situation is as follows kernel <= 5.1
kernel >= 5.2
it seems likely that we trigger a situation where we modified the FPU state, but the kernel thinks it is still valid, and thus skips the restore of the state that the kernel saved automatically on context-switch. we can't do the invalidation that the kernel does though, since we don't have access to the needed helpers. I'm a bit stumped on how to proceed - but I'll try to think some more tonight/tomorrow. |
b7be616
to
11170d9
Compare
@Fabian-Gruenbichler thank's for digging in to this. I ended up coming to basically the same conclusion, we're somehow leaving the kernel FPU in an inconsistent state despite the save/restore to the task struct. Given that, I decided to try a slightly different approach. The updated PR now uses a per-cpu variable to save/restore the FPU state to ensure nothing can modify it. As before the basic idea is to put everything back as we found it so what the kernel thinks is valid is still valid and there's no need to invalidate the register state. This is still a WIP and needs some more work/cleanup, but my early results are encouraging using the |
giving it a spin atm - the approach looks sound, initial testing as well! :) should be more robust since we just need to watch for new save/restore instructions appearing - that is usually known quite a while before they show up in actual hardware so we should not be in for surprises hopefully ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
been running it for ~2h without any issues cropping up - looking good ;)
some additional questions/suggestions, mostly stylistic.
|
||
for_each_possible_cpu(cpu) { | ||
zfs_kfpu_fpregs[cpu] = kmalloc_node(sizeof (struct fpu), | ||
GFP_KERNEL, cpu_to_node(cpu)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it guaranteed that all variants of saving overwrite their respective fpregs_state value completely, so that restoring does not leak (uninitialized) memory? or do we need to initialize this here and re-initialize it after each restore (or alternatively, before each save?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought. I've updated the patch to zero this memory when initially allocated. As for the need to re-initialize after a restore since we're using dedicated memory for this only available to zfs I don't believe we need to re-initialize it after each restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for not being verbose enough - what I am worried about is
task A does something with FPU
ZFS saves state (does this always write the full fsave/fxsave/xsave struct?)
ZFS restores state
task A
kernel clears HW FPU state if needed
task B
ZFS saves state (same question?)
ZFS restore (if the two questions above are not a yes, then this might now restore partial state from task A)
task B
or
task A does something with FPU
ZFS saves state (same question)
ZFS restores state
ZFS saves state (same question)
ZFS restores state
task A (potentially sees (partial) state from ZFS FPU operations in HW registers)
it's probably just a question of reading the spec - I won't have time for that today unfortunately, but hopefully on Monday.
Codecov Report
@@ Coverage Diff @@
## master #9406 +/- ##
==========================================
- Coverage 79.17% 79.03% -0.14%
==========================================
Files 412 412
Lines 123602 123577 -25
==========================================
- Hits 97864 97674 -190
- Misses 25738 25903 +165
Continue to review full report at Codecov.
|
f95fa71
to
4d5cfde
Compare
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#9346
4d5cfde
to
b6a1d5b
Compare
This is ready for another round of review and testing.
|
It might make sense to split the revert of the taskq dispatching into its own commit - it's not yet part of 0.8.x, would make for easier backporting of the rest ;) Currently giving this another round of battering, will report back with results. |
@Fabian-Gruenbichler were you able to perform any testing with the PR? Any results? |
sorry for missing posting an update - yes, I left our reproducer and some additional stress-testing running for quite a while without any visible issues. besides the mask constification (which definitely falls into the 'nit' category) there was only the part about whether we need clearing / re-initialization on each begin/end cycle:
not sure whether you saw that or not? I am not very fond of Github's review interface.. basically the question is - do we need to care about
and
both could leak ZFS or other data to the task that runs after ZFS. |
Sorry I missed that comment!
Good question. Yes, that is my understanding. As long as we set all of the bits in the xsave mask (which we do) we should be saving and restoring the registers exactly as they were and not leaking any ZFS state. We could try to get fancier and only save/restore the registers we're going to use, but I wanted to try and keep things simple. |
@Fabian-Gruenbichler if you're satisfied with the current PR, could you please approve it. If not, then let's see if we can get any remaining concerns sorted out. |
@tonyhutter Are You planning to release 0.8.3 version if this PR will go upstream? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XSAVE/XRSTOR (and variants) don't always save the full state even with an all-1 mask, but they keep track of what got saved and (re-)initialize the rest upon restore.
FXSAVE/FSAVE always save the full state AFAICT (unless SSE is disabled altogether ;)), FSAVE even clears the registers afterwards. FXRSTOR/FRSTOR thus always restore the full state as well.
the remaining stylistic changes can be pulled in as follow-ups.
@interduo yes, we'll include it in the next release assuming we don't find any other issues. |
@tonyhutter my question was a little bit different - will this PR could be a purpose for next release? |
@Fabian-Gruenbichler
My worries now concern the following: |
On October 26, 2019 1:21 pm, vorsich-tiger wrote:
> ... assuming we don't find any other issues.
@Fabian-Gruenbichler
@behlendorf
I'd appreciate very much if you'd help to establish here the final and valid set of assumptions that this new solution is based on.
I'm asking this because while trying to understand the validity of the new solution, I constantly ran into the problem that I wondered what kind of implicit assumptions about kernel/scheduler behaviour can be made (and which cannot).
Specifically I'd like to get *reasoned* confirmation that the following *potential* loophole in the scheme does *not* exist:
Let me start with a short summary of how *I* understand the new approach.
Please correct me where I'm wrong:
1. zfs module (when FPU API is not available) in future does not intend to rely on any specific kernel fpu treatment behaviour
2. thus, zfs has a per cpu private FPU state save/restore buffer
3. when any zfs in kernel thread starts using the fpu, it saves the existing/current fpu state to the thread-current cpu specific zfs save buffer
4. when any zfs in kernel thread stops using the fpu, it restores the "original" fpu state of the thread-current cpu specific zfs save buffer
5. OTOH - the current kernels seems to track 1 fpu state per *thread* though
My worries now concern the following:
What if there exists any (by intention) supported kernel/scheduler (i.e. maybe an RT kernel) which preemts a zfs in kernel thread between 3. and 4. ?
There I'd expect a problem if the scheduler assigns that cpu which was active at 3. (but did *not*yet* arrive at 4., so it has effectively "corrupted" the fpu state) to a completely different thread - in that case the kernel's view would be assuming the fpu's state is identical to the saved state at 3., right ?
The kernel *might* decide to *not* restore the fpu state from the new thread's save buffer if that thread and the zfs thread alternate on that cpu, right ?
What assumption can be made that this does not happen ?
Is this by general kernel design rules ?
preempts and interrupts are disabled between 3 and 4, just like for the
regular kernel mechanism (otherwise none of this would work ;)).
|
@interduo I would say that this patch is a contributing factor to wanting to put out a new release, but there are other reasons to create a 0.8.3 release as well (like misc bugfixes that have gone into master). I imagine we'll start putting together a 0.8.3 patch list in the upcoming weeks. |
@tonyhutter is there any hope to see 0.8.3 in this month? |
@interduo probably not, since Thanksgiving is this week. I was planning to talk to @behlendorf tomorrow and go over the potential patchlist though. There's ~260 patches in master that aren't in the release branch, and we need to figure out which ones we want to include. |
So maybe there is a need to do 0.9.x-rc1 release and then include all the patches? Ok, thanks for the answer - we are patiently waiting for release :) |
FYI - we actually plan to skip 0.9.x and go straight to a 2.0 release in 2020. 2.0 will include FreeBSD support. See slide 21 here: The 2.0 release will include all the patches from master. I imagine we'll do 2.0-rc releases like we did with 0.7.0 and 0.8.0. |
Ah, the good old Windows/iPhone trick of skipping version 9. @interduo I think a few months is an extremely fast release cycle for a filesystem. Right now I'm also eager for SIMD fixes, but except for that I don't want what's taking care of all my data to change often at all. |
I've had some general unstableness since updating to 0.8.2-2 (debian backports). Including general protection faults not apparently in ZFS code (alas, I've lost all the backtraces), and just hangs that went unlogged. Debian do have the their own patch to the patch:
Any chance they've screwed up the FPU restore, what would be the symptoms, and would they manifest at random places outside the ZFS code? What's the patch they should be applying? |
@spacelama What you mentioned in the changelog is exactly this patch: https://salsa.debian.org/zfsonlinux-team/zfs/blob/master/debian/patches/series#L14 And I'd say that we are not patching ZFS without forwarding to upstream, except for those distro-specific ones. |
@tonyhutter is there any hope to see 0.8.3 in december or january? |
Did this make it into the 0.8.3 release? I see in the changes:
But I don't see 10fa254. |
Motivation and Context
Issue #9346
Description
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to unconditionally save and restore the FPU state.
How Has This Been Tested?
Manually tested on a 4.19 and 5.2 kernel using the very effective
reproducer provided in #9346. While scrubbing a pool to force a
large number of checksum operations execute
mprime -t
to runthe internal mprime stress tests.
Without this change
mprime
errors out almost immediately, withthis PR applied no errors have been observed for either kernel.
The xsave, fxsr, and fnsave save/restore call paths where each
tested by modifying the code to force each callpath, and then
limiting the allowed SIMD instruction as appropriate (e.g. sse-only
for fxsr).
@Fabian-Gruenbichler would you mind reviewing this change.
Types of changes
Checklist:
Signed-off-by
.