-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page faults on Sapphire Rapids #14989
Comments
Wonder if there's something spicy about AVX512 save/restore that's not captured in our custom save/restore. @AttilaFueloep |
I can't imagine sapphire rapids' save/restore context for avx512 is very different from any other avx512 CPU. The VM detail seems germaine, though. Maybe the VM is stomping on avx512 register state because it doesn't realize it's on an avx512 CPU? Well, I guess it must have or it wouldn't have selected the avx512 Fletcher4 implementation. Something tells me the fault is somewhere in the hypervisor though. |
FWIW, the VM is running Windows 11 Pro with |
Oh wait no it sounds like the host OS is what's panicking, not the VM. So yeah my guess is that your VM is stomping all over your avx512 context because your hypervisor isn't presenting itself as an avx512 CPU. |
This seems like it should break more things than just ZFS if the guest can clobber the host's SIMD state, though... |
Yeah, seems likely. Any better luck with a Linux guest? |
I've tried a few restarts, but couldn't trigger it on both Windows and Linux VMs. I believe it'll take me a while to try to reproduce this. However, it does look like to be some sort of CPU bug where it got stuck in a buggy state, that for some reason avx512f + launching a VM triggers it. I'll update once I have more crashes. On the other hand, I have not yet observed this happening with avx512bw. |
This one is new (it crashed at
I also had a Windows VM running at the time of the crash (I almost always have one running for controlling the watercooling loop), though this time it didn't crash when starting the VM, but some time after. Since both of my original crash and this one seems to indicate |
Interestingly, I tried running a Windows KVM VM on my Ice Lake box with AVX512, running prime95, and scrubbing a pool using fletcher4 in a loop, and nothing has broken in a day. (Oh, and kernel 6.2.16, just to be as close as I could get it. I may go try 6.2.15 to see if there's somehow a secret bug fixed in the point release, but I didn't see anything suspicious in the changelog...) So maybe it really is somehow Sapphire Rapids specific... |
Hi, this is just a wild guess, but maybe try to disable preemption with |
I did try running y-cruncher with all AVX512 tests turned on for 8 hours, 4 hours with VM running, and no crashes there either. It really seems to require some specific processor state to trigger this crash. Today I updated the kernel to 6.3.6, and it took me around 7 reboots to trigger a crash, then two in rows. (Not sure if coincidence or the processor somehow stuck in that state across reboot, but after the second crash I did a power cycle, and four reboots later it still doesn't crash…) Interestingly, with 6.3.6 my call trace now includes
|
Thanks for the suggestion! I'll try this. Another member in Level1Techs also suggested disabling AMX with |
Oh. I think I might understand the bad place, here. If So we might be well off rn setting the bit to not save/restore AMX, since we're not using them anywhere, and then refactoring how we allocate preservation state, because XTILEDATA is very big indeed, comparatively. e: I could be wrong, I'm not an expert, but if we were undersizing the save/restore area, I think this might be what that explosion would look like...it could also be an errata, I'm not ruling that out. If it is the errata and we're mispreserving the AMX state, then assuming we're not touching the AMX state, masking off the AMX bits might still be a good plan. e2: if you wanted a hacky experiment patch that doesn't do any of the checking for if we have the defines for this like would be needed if we actually merged it:
|
Thanks, I'm testing this patch. Let's see for a few days if I'll have any crashes. For a reference, this is what my
|
I've been trying to reproduce the crash with the above patch applied, with multiple reboots and multiple VM relaunches over the past 2 days. So far, not a single crash yet. |
Never noticed this on Zen 4 (supports AVX512 but not AMX) so this one seems to be an Intel specific-issue. Nice finding! |
Not just Intel-specific, either, it doesn't happen on my Ice Lake box (which has AVX512F/BW, but not AMX). |
I have not had a single crash in over a week after applying @rincebrain's patch (even after countless reboots & VM start/shutdown). I think XTILEDATA might indeed be the cause of the issue. |
Does this show as slight IO wait in the beginning (once the first VM starts using a ZFS volume) that then grows over time until the entire file system gets unresponsive (although at the same time I stumbled over this issue while using ZFS with Proxmox on a new build of Intel Xeon Silver Not sure if it‘s really the same issue at play, but the only error I found in the syslog is a similar page fault with |
Xeon Silver 4116 should be Skylake, not SPR, so not this issue, whatever's broken, I think. I can't...imagine why clearing AMX would matter on those CPUs since I just double-checked, and they absolutely shouldn't have that. (4416, on the other hand, if you misspoke, would indeed be SPR, and potentially blow up like this.) |
Apologies for the typo, 4416+ (Sapphire Rapid). The 4116 would be quite a dated choice for a new build. |
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Fixes: openzfs#14989 Signed-off-by: Rich Ercolani <[email protected]>
Just want to report that the spr nobiscuit patch works great, I have not had a single crash for over a month already :-)
I have never actually measured this, but some disk access was possible for a brief moment after the crash. I've also seen reports in Level1Techs forum running Xeon Gold 6438Y+ having this issue on Proxmox as well (with Linux guest), so at least this does seem to affect all SPRs. |
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Fixes: openzfs#14989 Signed-off-by: Rich Ercolani <[email protected]>
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#14989 Closes openzfs#15168
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14989 Closes #15168
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#14989 Closes openzfs#15168
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#14989 Closes openzfs#15168 Signed-off-by: Rich Ercolani <[email protected]>
Sigh. Got it by that one. The fix is in zfs 2.2, I'm told. Hasn't been backported to 2.1.x. |
Intel SPR erratum SPR4 says that if you trip into a vmexit while doing FPU save/restore, your AMX register state might misbehave... and by misbehave, I mean save all zeroes incorrectly, leading to explosions if you restore it. Since we're not using AMX for anything, the simple way to avoid this is to just not save/restore those when we do anything, since we're killing preemption of any sort across our save/restores. If we ever decide to use AMX, it's not clear that we have any way to mitigate this, on Linux...but I am not an expert. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14989 Closes #15168 Signed-off-by: Rich Ercolani <[email protected]>
System information
Describe the problem you're observing
During some circumstances, ZFS may panic in
fletcher_4_avx512f_native
while starting a VM on Xeon 4th Gen (Sapphire Rapids). The system (the host) will become unresponsive soon after this happens, and any attempt to access the filesystem will simply freeze. (e.g., running a new program, but an already-running program continues to work as long as it doesn't do any filesystem access).The problem is usually goes away on the next boot after pressing a reset button. So far, I've been trying to find a way to reproduce this, but it seems to happen at random. During all crashes, I've observed the following:
The machine is running on Xeon w9-3495X with ECC memory (everything stock, no overclocks). IPMI shows no errors in the event logs.
I've also observed the crash with VM on zvol, and with VM as a disk image on a ZFS mount.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
This is from the latest crash:
This is from the subsequent reboot:
zpool:
$ zpool status pool: zroot state: ONLINE scan: scrub repaired 0B in 00:03:12 with 0 errors on Sat Jun 17 15:05:39 2023 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011201I6N-part2 ONLINE 0 0 0 nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011101I0I-part2 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011201I3R-part1 ONLINE 0 0 0 nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011301H6Y-part1 ONLINE 0 0 0 errors: No known data errors
I've also posted a full dmesg of two occasions of a crash here (there are more, but I have not been able to capture):
https://gist.github.com/sirn/0a9489444b4e9627ee5c2aa1bf60c242
The text was updated successfully, but these errors were encountered: