Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just after boot a focal image panics propolis #333

Open
leftwo opened this issue Mar 8, 2023 · 9 comments
Open

Just after boot a focal image panics propolis #333

leftwo opened this issue Mar 8, 2023 · 9 comments
Milestone

Comments

@leftwo
Copy link
Contributor

leftwo commented Mar 8, 2023

Running latest Omicron (on sn21) I encountered a panic in propolis-server after a first boot of a newly created Ubuntu Focal instance.

The tail of the propolis-server log looks like this:

Dec 28 03:06:02.914 INFO piix3pm ignored cfg read, register: SmbusBase, offset: 0, dev: chipset
Dec 28 03:06:02.914 INFO piix3pm ignored cfg read, register: DevResA, offset: 0, dev: chipset
Dec 28 03:06:02.915 INFO piix3pm ignored cfg read, register: DevResB, offset: 0, dev: chipset
Dec 28 03:06:02.915 INFO piix3pm ignored cfg read, register: DevResC, offset: 0, dev: chipset
Dec 28 03:06:02.915 INFO piix3pm ignored cfg read, register: DevResI, offset: 0, dev: chipset
Dec 28 03:06:02.915 INFO piix3pm ignored cfg read, register: DevResJ, offset: 0, dev: chipset
Dec 28 03:06:03.125 INFO rdmsr, msr: 3221225705, vcpu: 0, component: vcpu_tasks
Dec 28 03:06:43.620 WARN [1] flow control start 
Dec 28 03:06:44.621 WARN [1] flow control sending more work
Dec 28 03:06:44.621 WARN [1] flow control end 
thread 'vcpu-2' panicked at 'vCPU 2: Unhandled VM exit: InstEmul(InstEmul { inst_data: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 })', bin/propolis-server/src/lib/vm/mod.rs:450:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Dec 27 19:06:46 Stopping because all processes in service exited. ]
[ Dec 27 19:06:46 Executing stop method (:kill). ]

The stack of the thread that panic'd is:

-------------- thread# 140 / lwp# 140 [vcpu-2] ---------------                              
 fffff5ffef15863a _lwp_kill () + a                                                          
 fffff5ffef0ec58e raise (6) + 1e                                                            
 fffff5ffef0c5ff8 abort () + 58                                                             
 0000000001d2e7c9 panic_abort::__rust_start_panic::abort::hc1d5f3609ef6507d () + 9          
 0000000001d2e7b9 __rust_start_panic () + 9                                                 
 0000000001d19fac rust_panic () + 1c                                                        
 0000000001d19e34 std::panicking::rust_panic_with_hook::h3e4550c354b0930e () + 2c4          
 0000000001d19b56 std::panicking::begin_panic_handler::{{closure}}::he4c4fc5cc7d7a749 () + c6                                                                                           
 0000000001d16ef0 std::sys_common::backtrace::__rust_end_short_backtrace::hf95a242dcdd81a73 () + 20                                                                                     
 0000000001d19871 rust_begin_unwind () + 71                                                 
 0000000001d71a83 core::panicking::panic_fmt::h8e01a776a1dd4222 () + 33                     
 0000000000f12cf8 propolis_server::vm::SharedVmState::unhandled_vm_exit::hc565154474be8e59 () + 68                                                                                      
 00000000010532ea propolis_server::vcpu_tasks::VcpuTasks::vcpu_loop::h3e160f3429fdb39d () + 9ba                                                                                         
 000000000102a9b7 std::sys_common::backtrace::__rust_begin_short_backtrace::h00a3785c156ed8ed () + 57                                                                                   
 0000000000f99f8d core::ops::function::FnOnce::call_once{{vtable.shim}}::h1d68dceaad48552a () + bd                                                                                      
 0000000001d22d97 std::sys::unix::thread::Thread::new::thread_start::he5f444ef683dcd93 () + 27                                                                                          
 fffff5ffef15121c _thrp_setup (fffff5ffee734a40) + 6c
 fffff5ffef151530 _lwp_start ()
@leftwo
Copy link
Contributor Author

leftwo commented Mar 8, 2023

Core file available here: catacomb:/data/staff/core/gimlet-sn21/propolis-issue-333

@leftwo
Copy link
Contributor Author

leftwo commented Mar 8, 2023

Here is the propolis server log
propolis-server.log

@leftwo
Copy link
Contributor Author

leftwo commented Mar 9, 2023

Crucible version (from omicron package-manifest) is:

source.commit = "fb671895e8adb3cab5e801bbbe8728997178aba4"

@leftwo
Copy link
Contributor Author

leftwo commented Mar 9, 2023

I was able to reproduce this on a second attempt.

I suspect a version mismatch between the propolis version of Crucible and Omicrons version may be having an impact here.
Propolis has: f8bcd8ccc8cb649f510cdf0f58fbd6d1a40708ed
Where as Omicron has: fb671895e8adb3cab5e801bbbe8728997178aba4

There are some differences in the message passing protocol between these two versions.

@leftwo
Copy link
Contributor Author

leftwo commented Mar 9, 2023

Confirmed this was the issue.
I updated both propolis and omicron to use the same crucible rev, and my panic on boot went away.

I've filed an issue on the Crucible repo to better track and report over the wire version mismatch, and we probably need to move omicron to match the propolis version (or make them all use the latest crucible).

@leftwo leftwo closed this as completed Mar 9, 2023
@leftwo leftwo reopened this Mar 9, 2023
@leftwo
Copy link
Contributor Author

leftwo commented Mar 9, 2023

sigh
I confirmed it too quickly. I did boot, but propolis did panic after booting.

I also confirmed that Omicron is using a version of propolis that is using a version of crucible that matches with what Omicron is using, so there was no mismatch for stock Omicron bits.

@jmpesp
Copy link
Contributor

jmpesp commented Mar 9, 2023

I'm also seeing a similar panic with different values:

thread 'vcpu-3' panicked at 'vCPU 3: Unhandled VM exit: InstEmul(InstEmul { inst_data: [87, 194, 74, 179, 4, 116, 251, 176, 123, 154, 255, 255, 61, 97, 0], len: 15 })', bin/propolis-server/src/lib/vm/mod.rs:436:9
stack backtrace:
   0:          0x1cddb9c - std::backtrace_rs::backtrace::libunwind::trace::h9b534fc6095f520f
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:          0x1cddb9c - std::backtrace_rs::backtrace::trace_unsynchronized::hfcd8a890f991a10d
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:          0x1cddb9c - std::sys_common::backtrace::_print_fmt::hdf432e5f0940fa9c
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/sys_common/backtrace.rs:65:5
   3:          0x1cddb9c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5693a803500ade80
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/sys_common/backtrace.rs:44:22
   4:          0x1d352ca - core::fmt::write::h94f582bdfeeeb1a6
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/fmt/mod.rs:1209:17
   5:          0x1cce984 - std::io::Write::write_fmt::hd3f00507505f204d
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/io/mod.rs:1682:15
   6:          0x1cdd970 - std::sys_common::backtrace::_print::h21d778336866aef0
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/sys_common/backtrace.rs:47:5
   7:          0x1cdd970 - std::sys_common::backtrace::print::h8238ed32ff302292
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/sys_common/backtrace.rs:34:9
   8:          0x1ce06d6 - std::panicking::default_hook::{{closure}}::h933e4cf16faea6e2
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:267:22
   9:          0x1ce036c - std::panicking::default_hook::h9576f50ee054cc3c
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:286:9
  10:          0x1ce0ffc - std::panicking::rust_panic_with_hook::hb09aca5ad1ad2161
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:688:13
  11:          0x1ce0d75 - std::panicking::begin_panic_handler::{{closure}}::h189c984e260f383e
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:579:13
  12:          0x1cde040 - std::sys_common::backtrace::__rust_end_short_backtrace::hcab55adb02d49332
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/sys_common/backtrace.rs:137:18
  13:          0x1ce0aa1 - rust_begin_unwind
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/panicking.rs:575:5
  14:          0x1d31c53 - core::panicking::panic_fmt::hc653fbe903263e77
                               at /rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/core/src/panicking.rs:65:14
  15:           0xf3c3f8 - propolis_server::vm::SharedVmState::unhandled_vm_exit::h4a61df6e3da484d8

this is booting a helios-1.0.21472 guest using:

omicron: cb3d713a
propolis: d9434e5

I haven't seen a focal image panic yet on my setup

@jmpesp
Copy link
Contributor

jmpesp commented Mar 9, 2023

I have three dead propolis zones, and the panic seems to correlate with reading and writing the same MSR:

Mar 09 22:41:36.153 INFO rdmsr, msr: 3221291039, vcpu: 0, component: vcpu_tasks
Mar 09 22:41:36.153 INFO wrmsr, value: 70368744177664, msr: 3221291039, vcpu: 0, component: vcpu_tasks
thread 'vcpu-3' panicked at 'vCPU 3: Unhandled VM exit: InstEmul(InstEmul { inst_data: [87, 194, 74, 179, 4, 116, 251, 176, 123, 154, 255, 255, 61, 97, 0], len: 15 })', bin/propolis-server/src/lib/vm/mod.rs:436:9

Mar 09 22:42:56.858 INFO rdmsr, msr: 3221291039, vcpu: 0, component: vcpu_tasks
Mar 09 22:42:56.858 INFO wrmsr, value: 70368744177664, msr: 3221291039, vcpu: 0, component: vcpu_tasks
thread 'vcpu-7' panicked at 'vCPU 7: Unhandled VM exit: InstEmul(InstEmul { inst_data: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 })', bin/propolis-server/src/lib/vm/mod.rs:436:9

Mar 09 22:49:10.878 INFO wrmsr, value: 70368744177664, msr: 3221291039, vcpu: 0, component: vcpu_tasks
thread 'vcpu-5' panicked at 'vCPU 5: Unhandled VM exit: InstEmul(InstEmul { inst_data: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 })', bin/propolis-server/src/lib/vm/mod.rs:436:9

@jordanhendricks
Copy link
Contributor

jordanhendricks commented Mar 10, 2023

Just throwing this in here since I looked it up earlier. The MSR in @jmpesp's rdmsr/wrmsr output corresponds to MSR C001_001F, which is "Northbridge Configuration 1" (NB_CFG1).

The value it's writing is 0x4000_0000_0000, which is setting bit 46.

Per the AMD Bios and Kernel Developer's Guide, that is:

EnableCf8ExtCfg: enable CF8 extended configuration cycles. Reset: 0. 1=Allows the IO
configuration space access method, IOCF8 and IOCFC, to be used to generate extended configuration
cycles by enabling IOCF8[27:24]. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants