Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Propolis server SMF service listen address is not always set correctly #1115

Closed
bnaecker opened this issue May 25, 2022 · 0 comments · Fixed by #1124
Closed
Labels
Sled Agent Related to the Per-Sled Configuration and Management

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented May 25, 2022

I've noticed that occasionally a call to create/start a VM via the Oxide CLI seems to hang and then fail after about a minute. Inspecting the Nexus log, I see that the request timed out:

{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2022-05-25T05:32:37.432223873Z","hostname":"oxz_nexus","pid":20389,"uri":"//organizations/o/projects/p/instances/vm1/start","method":"POST","req_id":"85612f14-783f-40ea-982f-cc888d2320f8","remote_addr":"[fd00:1122:3344:101::1]:34100","local_addr":"[fd00:1122:3344:101::3]:12220","component":"dropshot_external","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","error_message_external":"Internal Server Error","error_message_internal":"CommunicationError: error sending request for url (http://[fd00:1122:3344:101::1]:12345/instances/8724453d-7959-44ed-809f-1bc9d02ebe9a): operation timed out","response_code":"500"}

Looking at the sled agent, I can see it hang after hitting this line:

{"msg":"Started propolis in zone: oxz_propolis-server_5d23c6ef-b4c0-4f0a-90de-fb5001133a5c","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:31:45.446013495Z","hostname":"feldspar","pid":14194,"instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
{"msg":"halt_and_remove_logged: Previous zone state: Running","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:32:39.406791793Z","hostname":"feldspar","pid":14194,"zone":"oxz_propolis-server_5d23c6ef-b4c0-4f0a-90de-fb5001133a5c","instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
{"msg":"Stopped and uninstalled zone","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:32:39.408091477Z","hostname":"feldspar","pid":14194,"zone":"oxz_propolis-server_5d23c6ef-b4c0-4f0a-90de-fb5001133a5c","instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}

It just sits there for about a minute, then removes the zone and cleans up. I was able to snag the Propolis logs during this hang, and we see this:

bnaecker@feldspar : ~/omicron $ pfexec zlogin $(zoneadm list | tail -1) 'tail +0 -f $(svcs -L propolis-server)'
[ May 24 22:31:45 Enabled. ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Restarting too quickly, changing state to maintenance. ]

That shows that the propolis-server binary was started with a listening address of "unknown", which is obviously wrong. That comes from the default value of the config/server_addr SMF property. The sled agent is supposed to be setting that property here to the actual IP address that Nexus provided for that Propolis server. That doesn't seem to always happen.

I modified the sled agent to print the values of that property for both the generic propolis-server SMF service, and for the specific instance of that service that it adds. Those seem to always show:

{"msg":"Generic Propolis server SMF properties: config/server_addr  astring  [fd00:1122:3344:101::17]:12400\n","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:31:45.331659446Z","hostname":"feldspar","pid":14194,"instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
{"msg":"Propolis server SMF instance properties: ","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:31:45.408559664Z","hostname":"feldspar","pid":14194,"instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}

That is, the "generic" SMF service svc:/system/illumos/propolis-server seems to get the provided value. The specific instance of the service, in this case svc:/system/illumos/propolis-server:vm-5d23c6ef-b4c0-4f0a-90de-fb5001133a5c has an empty string as its property value. Presumably this is because the instance is supposed to inherit properties from its parent configuration, unless they've been overridden. Unfortunately, those are always the values, when the instance provision succeeds or hangs.

However, when it hangs, I've also seen that inspecting the property value with svccfg directly from the global zone shows that it remains unmodified:

bnaecker@feldspar : ~/omicron $ pfexec zlogin $(zoneadm list | tail -1) 'svccfg -v -s propolis-server listprop config/server_addr'
config/server_addr  astring  unknown

I really don't have much else yet, but it's getting late and I wanted to record my debugging state. I've experimented a bit with setting the property directly on the instance, but that fails with the error: svccfg: No such property group "config". I've also tried calling svcprop refresh, but that seems to have no effect.

Overall, it appears there's a race or some other inconsistency in how the actual SMF service instance we're launching gets the value of its property config/server_addr. My current steps to repro this are, unfortunately, just starting/stopping an instance a ton of times. I can usually hit this after a median of about 5 retries, but it's extremely variable.

@smklein smklein added the Sled Agent Related to the Per-Sled Configuration and Management label May 26, 2022
leftwo pushed a commit that referenced this issue Jan 28, 2024
Crucible changes
Remove a superfluous copy during write serialization (#1087)
Update to progenitor v0.5.0, pull in required Omicron updates (#1115)
Update usdt to v0.5.0 (#1116)
Do not panic on reinitialize of a downstairs client. (#1114)
Bump (tracing-)opentelemetry(-jaeger) (#1113)
Make the Guest -> Upstairs queue fully async (#1086)
Switch to per-block ownership (#1107)
Handle timeout in the client IO task (#1109)
Enforce buffer alignment (#1106)
Block size buffers (#1105)
New dtrace probes and a counter struct in the Upstairs. (#1104)
Implement read decryption offloading (#1089)
Remove Arc + Mutex from Buffer (#1094)
Comment cleanup and rename of DsState::Repair -> Reconcile (#1102)
do not panic the dynamometer for OOB writes (#1101)
Allow dsc to start the downstairs in read-only mode. (#1098)
Use the omicron-zone-package methods for topo sorting (#1099)
Package with topological sorting (#1097)
Fix clippy lints in dsc (#1095)

Propolis changes:
PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626)
PHD: fix missing newlines in serial.log (#622)
PHD: fix run_shell_command with multiline commands (#621)
PHD: fix `--artifact-directory` not doing anything (#618)
Update h2 dependency
Update Crucible (and Omicron) dependencies
PHD: refactor guest serial console handling (#615)
phd: add basic "migration-from-base" tests + machinery (#609)
phd: Ensure min disk size fits read-only parents (#611)
phd: automatically fetch `crucible-downstairs` from Buildomat (#604)
Mitigate behavior from illumos#16183
PHD: add guest adapter for WS2022 (#607)
phd: include error cause chain in failure output (#606)
add QEMU pvpanic ISA device (#596)
Add crucible-mem backend
Make crucible opt parsing more terse in standalone
leftwo added a commit that referenced this issue Jan 29, 2024
Crucible changes
Remove a superfluous copy during write serialization (#1087) Update to
progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt
to v0.5.0 (#1116)
Do not panic on reinitialize of a downstairs client. (#1114) Bump
(tracing-)opentelemetry(-jaeger) (#1113)
Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block
ownership (#1107)
Handle timeout in the client IO task (#1109)
Enforce buffer alignment (#1106)
Block size buffers (#1105)
New dtrace probes and a counter struct in the Upstairs. (#1104)
Implement read decryption offloading (#1089)
Remove Arc + Mutex from Buffer (#1094)
Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do
not panic the dynamometer for OOB writes (#1101) Allow dsc to start the
downstairs in read-only mode. (#1098) Use the omicron-zone-package
methods for topo sorting (#1099) Package with topological sorting
(#1097)
Fix clippy lints in dsc (#1095)

Propolis changes:
PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) 
PHD: fix missing newlines in serial.log (#622)
PHD: fix run_shell_command with multiline commands (#621) 
PHD: fix `--artifact-directory` not doing anything (#618) Update h2
dependency
Update Crucible (and Omicron) dependencies
PHD: refactor guest serial console handling (#615) 
phd: add basic "migration-from-base" tests + machinery (#609) 
phd: Ensure min disk size fits read-only parents (#611) 
phd: automatically fetch `crucible-downstairs` from Buildomat (#604)
Mitigate behavior from illumos#16183
PHD: add guest adapter for WS2022 (#607)
phd: include error cause chain in failure output (#606) add QEMU pvpanic
ISA device (#596)
Add crucible-mem backend
Make crucible opt parsing more terse in standalone

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
2 participants