Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot on panic, oops, and zfs "panic" #288

Merged
merged 1 commit into from
Jun 6, 2024
Merged

Reboot on panic, oops, and zfs "panic" #288

merged 1 commit into from
Jun 6, 2024

Conversation

barrucadu
Copy link
Owner

The default behaviour of actual panics and zfs "panics" halting the system aren't very useful to me, since I'm not a kernel developer. I want the system to just reboot. An oops is also often not recoverable, so just panic / reboot on those as well.

This is motivated by a few recent zfs "panics" on nyarlathotep:

VERIFY3(sa.sa_magic == SA_MAGIC) failed (8192 == 3100762)
PANIC at zfs_quota.c:88:zpl_get_file_info()
Showing stack for process 411118
CPU: 11 PID: 411118 Comm: nix Tainted: P           O       6.1.92 #1-NixOS
Hardware name: ASUS System Product Name/PRIME B650M-A AX, BIOS 0421 08/19/2022
Call Trace:
 <TASK>
 dump_stack_lvl+0x44/0x5c
 spl_panic+0xf0/0x108 [spl]
 ? srso_alias_return_thunk+0x5/0x7f
 ? dnode_cons+0x2a1/0x2c0 [zfs]
 zpl_get_file_info+0x227/0x240 [zfs]
 dmu_objset_userquota_get_ids+0x243/0x4b0 [zfs]
 dnode_setdirty+0x33/0xe0 [zfs]
 dnode_allocate+0x160/0x1d0 [zfs]
 dmu_object_alloc_impl+0x35f/0x3f0 [zfs]
 zap_create_norm_dnsize+0x4f/0xa0 [zfs]
 zfs_mknode+0xe9d/0x1000 [zfs]
 zfs_mkdir+0x51a/0x710 [zfs]
 zpl_mkdir+0xc7/0x1d0 [zfs]
 vfs_mkdir+0x9c/0x140
 do_mkdirat+0x142/0x170
 __x64_sys_mkdir+0x45/0x60
 do_syscall_64+0x34/0x80
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fbf6b5a395b
Code: 0f 1e fa 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 d7 cc ff ff 0f 1f 80>
RSP: 002b:00007ffe3fb3b398 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 00007ffe3fb3b7e0 RCX: 00007fbf6b5a395b
RDX: 0000560a818fdd1c RSI: 00000000000001ff RDI: 0000560fe1d64660
RBP: 00007ffe3fb3b3a0 R08: 0000000000000007 R09: 0000000000000006
R10: 0000000000000007 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffe3fb3c190 R14: 00007ffe3fb3b540 R15: 00007ffe3fb3b570
 </TASK>

Unfortunately it doesn't actually say which pool it is, but I think it's local because non-NAS processes begin to fail, e.g. logs about nix hanging, prometheus stops writing data, grafana crashes...

The default behaviour of actual panics and zfs "panics" halting the
system aren't very useful to me, since I'm not a kernel developer.  I
want the system to just reboot.  An oops is also often not recoverable,
so just panic / reboot on those as well.

This is motivated by a few recent zfs "panics" on nyarlathotep:

    VERIFY3(sa.sa_magic == SA_MAGIC) failed (8192 == 3100762)
    PANIC at zfs_quota.c:88:zpl_get_file_info()
    Showing stack for process 411118
    CPU: 11 PID: 411118 Comm: nix Tainted: P           O       6.1.92 #1-NixOS
    Hardware name: ASUS System Product Name/PRIME B650M-A AX, BIOS 0421 08/19/2022
    Call Trace:
     <TASK>
     dump_stack_lvl+0x44/0x5c
     spl_panic+0xf0/0x108 [spl]
     ? srso_alias_return_thunk+0x5/0x7f
     ? dnode_cons+0x2a1/0x2c0 [zfs]
     zpl_get_file_info+0x227/0x240 [zfs]
     dmu_objset_userquota_get_ids+0x243/0x4b0 [zfs]
     dnode_setdirty+0x33/0xe0 [zfs]
     dnode_allocate+0x160/0x1d0 [zfs]
     dmu_object_alloc_impl+0x35f/0x3f0 [zfs]
     zap_create_norm_dnsize+0x4f/0xa0 [zfs]
     zfs_mknode+0xe9d/0x1000 [zfs]
     zfs_mkdir+0x51a/0x710 [zfs]
     zpl_mkdir+0xc7/0x1d0 [zfs]
     vfs_mkdir+0x9c/0x140
     do_mkdirat+0x142/0x170
     __x64_sys_mkdir+0x45/0x60
     do_syscall_64+0x34/0x80
     entry_SYSCALL_64_after_hwframe+0x6e/0xd8
    RIP: 0033:0x7fbf6b5a395b
    Code: 0f 1e fa 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 d7 cc ff ff 0f 1f 80>
    RSP: 002b:00007ffe3fb3b398 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
    RAX: ffffffffffffffda RBX: 00007ffe3fb3b7e0 RCX: 00007fbf6b5a395b
    RDX: 0000560a818fdd1c RSI: 00000000000001ff RDI: 0000560fe1d64660
    RBP: 00007ffe3fb3b3a0 R08: 0000000000000007 R09: 0000000000000006
    R10: 0000000000000007 R11: 0000000000000206 R12: 0000000000000000
    R13: 00007ffe3fb3c190 R14: 00007ffe3fb3b540 R15: 00007ffe3fb3b570
     </TASK>

Unfortunately it doesn't actually say which pool it is, but I think
it's `local` because non-NAS processes begin to fail, e.g. logs about
nix hanging, prometheus stops writing data, grafana crashes...
@barrucadu barrucadu merged commit 0552e14 into master Jun 6, 2024
2 checks passed
@barrucadu barrucadu deleted the panic branch June 6, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant