Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed behavior: chained shutdown of netvm client VMs when upstream VM exits? #7266

Closed
brendanhoar opened this issue Feb 12, 2022 · 7 comments
Labels
C: kernel P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@brendanhoar
Copy link

brendanhoar commented Feb 12, 2022

How to file a helpful issue

Qubes OS release

R4.0 updated w/ current-testing and kernel-lastest (dom0 and VMs)

Brief summary

When an upstream network-providing VM exits, client VMs that depend upon that network-providing VM exit as well.

This forced shutdown behavior is new. I had noticed it on a test R4.1rc4 + updates machine, but ran across it today on my daily driver which is R4.0.

Steps to reproduce

Start personal VM. This autostarts four other VMs: sys-net, sys-mirage-vpn-to-net, sys-vpn, sys-mirage-vms-to-vpn. The mirage VPNs are low-memory usage firewalls.

Edited:
Exiting upstream VMs (e.g. executing shutdown -h now in sys-vpn, or pause/kill on the mirage VMs) cause the chain of all Linux downstream VMs to shutdown as well. If a mirage VM is reached, it does not shut down.

Expected behavior

No other VMs should be forced to exit.

VMs should not automatically shutdown unless:

  1. the user explicitly shuts the VM down.
  2. the VM is a disposable VM and the invoking app exits
  3. the user has configured idle shutdown timeouts

Actual behavior

VMs in use are unexpectedly shut down, in a chained or tree fashion, potentially causing data loss, and certainly causing annoyance.

In particular, I've been in the habit of manually shutting down the entire networking "conduit" and then restarting the networking "conduit" after template updates. And in Qubes, the networking "conduit" is often 3-4 VMs. As I noted before, this "also shutdown the networking clients" behavior is new.

Brendan

@brendanhoar brendanhoar added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Feb 12, 2022
@Minimalist73
Copy link

I experienced this too on 4.1.
My NetVM was stuck for some reason so I killed it with the Qubes Manager and all my qube attached to it instantly popped out. That was not doing this before and this is annoying.

@marmarek
Copy link
Member

It's more likely it is a kernel panic - specifically - #7257 . There is no intentional feature like this. Please check /var/log/xen/console/guest-*.log of relevant qubes to confirm.

@Minimalist73
Copy link

@marmarek Here's what I get when the qube shutdown:

[2022-02-13 00:22:49] [   32.794981] #PF: supervisor read access in kernel mode
[2022-02-13 00:22:49] [   32.794989] #PF: error_code(0x0000) - not-present page
[2022-02-13 00:22:49] [   32.794998] PGD 0 P4D 0 
[2022-02-13 00:22:49] [   32.795003] Oops: 0000 [#1] PREEMPT SMP PTI
[2022-02-13 00:22:49] [   32.795011] CPU: 3 PID: 64 Comm: xenwatch Not tainted 5.16.5-1.fc32.qubes.x86_64 #1
[2022-02-13 00:22:49] [   32.795024] RIP: 0010:free_netdev+0xa3/0x1a0
[2022-02-13 00:22:49] [   32.795037] Code: ff 48 89 df e8 1e de 00 00 48 8b 43 50 48 8b 08 48 8d b8 a0 fe ff ff 48 8d a9 a0 fe ff ff 49 39 c4 75 26 eb 47 e8 bd d4 6c ff <48> 8b 85 60 01 00 00 48 8d 95 60 01 00 00 48 89 ef 48 2d 60 01 00
[2022-02-13 00:22:49] [   32.795062] RSP: 0018:ffffc90000b3fd60 EFLAGS: 00010286
[2022-02-13 00:22:49] [   32.795070] RAX: 0000000000000000 RBX: ffff8880ed769000 RCX: 0000000000000000
[2022-02-13 00:22:49] [   32.795082] RDX: 0000000000000001 RSI: ffffc90000b3fc90 RDI: 00000000ffffffff
[2022-02-13 00:22:49] [   32.795093] RBP: fffffffffffffea0 R08: 0000000000000001 R09: 0000000000000000
[2022-02-13 00:22:49] [   32.795104] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8880ed769050
[2022-02-13 00:22:49] [   32.795115] R13: ffff888006c75f88 R14: ffff888003fc2b80 R15: ffff88800855a880
[2022-02-13 00:22:49] [   32.795126] FS:  0000000000000000(0000) GS:ffff8880f5d80000(0000) knlGS:0000000000000000
[2022-02-13 00:22:49] [   32.795138] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2022-02-13 00:22:49] [   32.795147] CR2: 0000000000000000 CR3: 00000000a9d98004 CR4: 00000000003706e0
[2022-02-13 00:22:49] [   32.795159] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2022-02-13 00:22:49] [   32.795170] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2022-02-13 00:22:49] [   32.795181] Call Trace:
[2022-02-13 00:22:49] [   32.795186]  <TASK>
[2022-02-13 00:22:49] [   32.795193]  xennet_remove+0x65/0x80 [xen_netfront]
[2022-02-13 00:22:49] [   32.795204]  xenbus_dev_remove+0x6d/0xf0
[2022-02-13 00:22:49] [   32.795213]  __device_release_driver+0x17a/0x240
[2022-02-13 00:22:49] [   32.795223]  device_release_driver+0x24/0x30
[2022-02-13 00:22:49] [   32.795232]  bus_remove_device+0xd8/0x140
[2022-02-13 00:22:49] [   32.795239]  device_del+0x18b/0x410
[2022-02-13 00:22:49] [   32.795246]  ? _raw_spin_unlock+0x16/0x30
[2022-02-13 00:22:49] [   32.795254]  ? klist_iter_exit+0x14/0x20
[2022-02-13 00:22:49] [   32.795262]  device_unregister+0x13/0x60
[2022-02-13 00:22:49] [   32.795268]  xenbus_dev_changed+0x18e/0x1f0
[2022-02-13 00:22:49] [   32.795276]  xenwatch_thread+0xc0/0x1a0
[2022-02-13 00:22:49] [   32.795284]  ? do_wait_intr_irq+0xa0/0xa0
[2022-02-13 00:22:49] [   32.795291]  ? read_reply+0x160/0x160
[2022-02-13 00:22:49] [   32.795298]  kthread+0x158/0x180
[2022-02-13 00:22:49] [   32.795306]  ? set_kthread_struct+0x40/0x40
[2022-02-13 00:22:49] [   32.795313]  ret_from_fork+0x22/0x30
[2022-02-13 00:22:49] [   32.795322]  </TASK>
[2022-02-13 00:22:49] [   32.795326] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rfkill ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack xenfs nft_counter nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel xen_netfront snd_pcm snd_timer snd soundcore pcspkr xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn parport_pc ppdev lp parport drm fuse sunrpc bpf_preload ip_tables overlay xen_blkfront
[2022-02-13 00:22:49] [   32.795412] CR2: 0000000000000000
[2022-02-13 00:22:49] [   32.795420] ---[ end trace b594eee2680b0682 ]---
[2022-02-13 00:22:49] [   32.795428] RIP: 0010:free_netdev+0xa3/0x1a0
[2022-02-13 00:22:49] [   32.795437] Code: ff 48 89 df e8 1e de 00 00 48 8b 43 50 48 8b 08 48 8d b8 a0 fe ff ff 48 8d a9 a0 fe ff ff 49 39 c4 75 26 eb 47 e8 bd d4 6c ff <48> 8b 85 60 01 00 00 48 8d 95 60 01 00 00 48 89 ef 48 2d 60 01 00
[2022-02-13 00:22:49] [   32.795462] RSP: 0018:ffffc90000b3fd60 EFLAGS: 00010286
[2022-02-13 00:22:49] [   32.795470] RAX: 0000000000000000 RBX: ffff8880ed769000 RCX: 0000000000000000
[2022-02-13 00:22:49] [   32.795482] RDX: 0000000000000001 RSI: ffffc90000b3fc90 RDI: 00000000ffffffff
[2022-02-13 00:22:49] [   32.795493] RBP: fffffffffffffea0 R08: 0000000000000001 R09: 0000000000000000
[2022-02-13 00:22:49] [   32.795504] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8880ed769050
[2022-02-13 00:22:49] [   32.795515] R13: ffff888006c75f88 R14: ffff888003fc2b80 R15: ffff88800855a880
[2022-02-13 00:22:49] [   32.795526] FS:  0000000000000000(0000) GS:ffff8880f5d80000(0000) knlGS:0000000000000000
[2022-02-13 00:22:49] [   32.795537] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2022-02-13 00:22:49] [   32.795547] CR2: 0000000000000000 CR3: 00000000a9d98004 CR4: 00000000003706e0
[2022-02-13 00:22:49] [   32.795558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2022-02-13 00:22:49] [   32.795569] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[2022-02-13 00:22:49] [   32.795580] Kernel panic - not syncing: Fatal exception
[2022-02-13 00:22:49] [   32.795621] Kernel Offset: disabled

@brendanhoar
Copy link
Author

Ah kernel panic, that makes more sense. I suspect that if I go back and retest I'll find the mirage VMs end up being, ahem, firewalls in the shutdown chain. If so I'll update the OP.

Brendan

@andrewdavidwong andrewdavidwong added C: kernel needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Feb 13, 2022
@andrewdavidwong andrewdavidwong added this to the Release 4.0 updates milestone Feb 13, 2022
@unman
Copy link
Member

unman commented Feb 13, 2022 via email

@brendanhoar
Copy link
Author

brendanhoar commented Feb 13, 2022

Thinkpad W520:
Confirmed as repeatable under R4.0 with VM kernel 5.16.5-1.fc25.
Confirmed as non-repeatable under R4.0 with VM kernel 5.15.14-1.fc25.

GPD Pocket 3:
From memory, I saw the issue several times on this, an R4.1 system. VMs were also running 5.16.5-1 as well.

Agreed that this is likely a duplicate of: #7257

B

@andrewdavidwong
Copy link
Member

This appears to be a duplicate of an existing issue. If so, please comment on the appropriate existing issue instead. If anyone believes this is not really a duplicate, please leave a comment briefly explaining why. We'll be happy to take another look and, if appropriate, reopen this issue. Thank you.

@andrewdavidwong andrewdavidwong added R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Feb 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: kernel P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

5 participants