-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page Fault on Pool Import #15513
Comments
I've tried to assert all overflows I could guess in replay in #15517 . Would be good if somebody could try reproducing it with debug ZFS build and may be my patch. |
@amotin I'm happy to repro with a debug build if someone can point me to how to install debug modules on Arch Linux. FYI I get similar behavior when building against |
@dpendolino I did the following using zfs-dkms-git aur: create
download the patch from #15517 with into the aur
modify
create a snapshot before installing and also clone install with |
@amotin also tried to trigger #15485
it happens when running OpenWrt Kernel Build - this line:
I'm not so familiar with linux kernel builds but building openwrt from scratch like in #15485 (comment) triggers it reliable here. The failed import also happened after a crash caused by this (using older zfs git from hope this somehow helps. |
source of
This happened directly after triggering the issue and rebooting the machine due to the hang (also zfs debug build): |
@mtippmann Your last panic looks like it can be a different flavor of the earlier one. I've extended my assertions patch to catch that scenario also. Though my patch only catches the consequences after reboot, I am still not sure what happens before. Is it coincidence to see block cloning involved there or it is the cause/trigger? |
I'll rebuild and post the crash on next reboot - as for block cloning it's used but not intentionally by me.
arch has coreutils 9.4 - it happened the first time on an encrypted pool also running git at the time - building OpenWrt that triggered the import panic after reboot - from a quick look at I then recreated the pool without encryption and I can trigger the oops when building OpenWrt but import does still seems to work fine (except now with the debug build) I know that's not super helpful - I don't think I've hit that fixed bug regarding cloning on non-encrypted / encrypted datasets - so this here seems to be something else. |
Correction: read-only import still works fine. |
Last screenshots are with the updated patch from #15517. |
I'm happy to keep testing new patches if folks think we're close, but if not, then I may just need to rebuild in order to get my laptop up and running again. |
On a pool with current git zfs and block cloning disabled the issue can't be triggered by me anymore. So I guess it's related to block cloning. |
@mtippmann is it possible to disable block cloning on a pool that can't be import read/write? I assume not, but it would be really nice to find a way to not have to rebuild. |
@dpendolino When poll corruption happened -- it already happened, according to provided panics with my assertions patch ZIL is really corrupted. We should diagnose what is going on before the reboot, what causes the original panic and probably some memory corruptions we see as corrupted ZIL. |
@amotin gotcha, then let me know anything else you need from me, and I'll rebuild later tonight. |
I don't know - the new pool without block-cloning was recreated with |
I think I've found the cause of crash during the encrypted pool import: #15543 -- encryption for block clone ZIL records was not done correctly. It does not explain the original crash you see during the build, that is likely a different issue. PS: It will not fix already corrupted pools, only prevent new corruptions. |
Tested #15566 and #15543 with zfs git on a pool just upgraded to all features in current git - and while import still works (so #15543 seems to work) it still crashes on building OpenWrt during vdso generation... @amotin unfortunatly no debug build - if it's useful I can rerun with debug enabled @robn this also only happens when block cloning is active and not without block cloning. might be interesting.
|
Setting |
It should be purely textual change to make the code more readable. Should cause no functional difference. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513
In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15543 Closes openzfs#15513
In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513
In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15543 Closes openzfs#15513
It should be purely textual change to make the code more readable. Should cause no functional difference. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Reviewed-by: Sean Eric Fagan <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Edmund Nadolski <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15543 Closes openzfs#15513
System information
Describe the problem you're observing
This is an encrypted single disk root pool that will no longer boot. Any attempts to import the pool on a live environment causes the following page fault:
dmesg.txt
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
# zpool import -R /mnt -f reddwarf-zroot
with the
readonly
flag set, the pool will import and the data is there.# zpool import -R /mnt -f -o readonly=on reddwarf-zroot
The text was updated successfully, but these errors were encountered: