Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance stuck in Starting with a Propolis zone but no server process #3319

Open
gjcolombo opened this issue Jun 8, 2023 · 1 comment
Open
Labels
Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Milestone

Comments

@gjcolombo
Copy link
Contributor

Seen in dogfood after creating an instance, stopping it, and trying to start it again. I don't have better repro steps right now.

The sled agent logs on this sled rotated at around the time the relevant Propolis was created. The old log shows sled agent telling Nexus that the VMM is starting, but then the log cuts off without a clear indication of why there's no Propolis process in the zone:

// gjc: this is the instance of interest

23:22:49.756Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
    state = InstanceRuntimeState { run_state: Starting, sled_id: 585d0736-3f48-45a2-a9c0-a43d20796a96, propolis_id: 091d25a0-35cc-4556-8b14-a6f6d82fe93d, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:105::26]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(2), memory: ByteCount(8589934592), hostname: "bootme", gen: Generation(11), time_updated: 2023-06-07T23:22:49.756949218Z }
23:22:49.838Z INFO SledAgent (PortManager): Created OPTE port
    port = Port { inner: PortInner { name: "opte6", _ip: 172.30.0.5, mac: MacAddr6([168, 64, 37, 248, 158, 101]), slot: 0, vni: Vni { inner: 2783369 }, gateway: Gateway { mac: MacAddr6([168, 64, 37, 255, 119, 119]), ip: 172.30.0.1 }, vnic: "vopte6" } }
23:22:49.891Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
23:22:49.923Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
23:22:52.755Z INFO SledAgent (InstanceManager): Zone booting
    instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
    zone = oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device until then. Error: DeleteVnicError { name: "vopte6", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte6", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }

// gjc: old log cuts off here
// gjc: new log begins here

BRM42220014 # looker -f ./oxide-sled-agent\:default.log   
23:57:59.026Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:105::1]:12345
    remote_addr = [fd00:1122:3344:106::4]:44379
23:57:59.026Z INFO SledAgent (InstanceManager): Mapping virtual NIC to physical host
    mapping = SetVirtualNetworkInterfaceHost { virtual_ip: 172.30.0.8, virtual_mac: MacAddr(MacAddr6([168, 64, 37, 252, 135, 30])), physical_host_ip: fd00:1122:3344:108::1, vni: Vni(7796804) }
23:57:59.026Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:105::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:106::4]:44379
    req_id = b1839181-d657-40ac-a432-f880158cc52b
    response_code = 204
    uri = /v2p/1c626997-d332-49a2-a81e-1c5cfcd3d600

The zone exists:

BRM42220014 # zoneadm list
global
oxz_switch
oxz_ntp
oxz_crucible_oxp_53c0c042-5225-42df-81cf-6dbecf8e2582
oxz_crucible_oxp_3d9dfd38-97b4-41de-b0d0-4d2034f463b7
oxz_crucible_oxp_3a555b82-8caa-4373-8ac7-6a7b1950ec0a
oxz_crucible_oxp_8a51df51-d69c-45e2-9b71-0aa7b4fd58f4
oxz_crucible_oxp_e4d786f7-5735-4426-8c82-8b4604ae5484
oxz_crucible_oxp_a58f6b4f-b1b4-4f09-a72f-7aa17e7bda2e
oxz_crucible_oxp_9ceafe70-961e-438a-91bb-eb6eec9d6318
oxz_crucible_oxp_c24e47d1-0aae-4018-a21e-b0a63787cf9e
oxz_crucible_oxp_074fb480-5745-4a42-a3cb-b1b465cbd57f
oxz_crucible_oxp_50fd51c4-12af-4ac2-bda7-1b245705245a
oxz_propolis-server_2f422cf8-393c-4401-9ce6-820cede5dec5
oxz_propolis-server_f60ad21f-bad6-4d1d-8876-1d0529e86c13
oxz_propolis-server_3e387595-e2f6-4f51-97c9-10f81e93c3c7
oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d

But it has no Propolis service:

root@oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d:~# svcs | grep propolis
root@oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d:~# 

And there are no leftover logs from a prior invocation of the service. It's been several hours since this happened, but at least right now there's plenty of ramdisk space available on the sled:

Filesystem             Size   Used  Available Capacity  Mounted on
rpool/ROOT/ramdisk    3.81G  1.32G      2.49G    35%    /

Asking the instance to stop while it's in this state currently pushes it into a zombie Destroyed state; that's probably a result of #3260, the fix for which merged today and hasn't been picked up into dogfood yet. (The zone is successfully torn down in this case, so if the instance were instead "Stopped"--as it should be once the fix for #3260 is in place--it should be possible to start it again.)

So the things to follow up on here are:

  • Why didn't this Propolis start (or why did it fail?)
  • If we can't figure out why this Propolis didn't start, how do we make it possible to figure that out?
  • Why wasn't sled agent able to detect this condition and mark the instance as Failed?
  • What is the appropriate remediation for an instance that's in this state? Is it just to stop the instance?
@gjcolombo gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management labels Jun 8, 2023
@gjcolombo gjcolombo added this to the FCS milestone Jun 8, 2023
@bnaecker
Copy link
Collaborator

bnaecker commented Jun 8, 2023

We've previously seen similar situations in issues like #1115 and #1120. The Propolis zone started, but the service itself failed very quickly for reasons that were sometimes hard to see. The core problem for those seemed to be resolved by #1124. It's possible that either did not fix the issue or that this is just unrelated.

@morlandi7 morlandi7 modified the milestones: FCS, 1.0.2 Aug 15, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.2, MVP Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Projects
None yet
Development

No branches or pull requests

4 participants