-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Propolis server SMF service listen address is not always set correctly #1115
Labels
Sled Agent
Related to the Per-Sled Configuration and Management
Comments
smklein
added
the
Sled Agent
Related to the Per-Sled Configuration and Management
label
May 26, 2022
19 tasks
leftwo
pushed a commit
that referenced
this issue
Jan 28, 2024
Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone
leftwo
added a commit
that referenced
this issue
Jan 29, 2024
Crucible changes Remove a superfluous copy during write serialization (#1087) Update to progenitor v0.5.0, pull in required Omicron updates (#1115) Update usdt to v0.5.0 (#1116) Do not panic on reinitialize of a downstairs client. (#1114) Bump (tracing-)opentelemetry(-jaeger) (#1113) Make the Guest -> Upstairs queue fully async (#1086) Switch to per-block ownership (#1107) Handle timeout in the client IO task (#1109) Enforce buffer alignment (#1106) Block size buffers (#1105) New dtrace probes and a counter struct in the Upstairs. (#1104) Implement read decryption offloading (#1089) Remove Arc + Mutex from Buffer (#1094) Comment cleanup and rename of DsState::Repair -> Reconcile (#1102) do not panic the dynamometer for OOB writes (#1101) Allow dsc to start the downstairs in read-only mode. (#1098) Use the omicron-zone-package methods for topo sorting (#1099) Package with topological sorting (#1097) Fix clippy lints in dsc (#1095) Propolis changes: PHD: demote artifact store logs to DEBUG, enable DEBUG on CI (#626) PHD: fix missing newlines in serial.log (#622) PHD: fix run_shell_command with multiline commands (#621) PHD: fix `--artifact-directory` not doing anything (#618) Update h2 dependency Update Crucible (and Omicron) dependencies PHD: refactor guest serial console handling (#615) phd: add basic "migration-from-base" tests + machinery (#609) phd: Ensure min disk size fits read-only parents (#611) phd: automatically fetch `crucible-downstairs` from Buildomat (#604) Mitigate behavior from illumos#16183 PHD: add guest adapter for WS2022 (#607) phd: include error cause chain in failure output (#606) add QEMU pvpanic ISA device (#596) Add crucible-mem backend Make crucible opt parsing more terse in standalone Co-authored-by: Alan Hanson <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've noticed that occasionally a call to create/start a VM via the Oxide CLI seems to hang and then fail after about a minute. Inspecting the Nexus log, I see that the request timed out:
Looking at the sled agent, I can see it hang after hitting this line:
It just sits there for about a minute, then removes the zone and cleans up. I was able to snag the Propolis logs during this hang, and we see this:
That shows that the
propolis-server
binary was started with a listening address of"unknown"
, which is obviously wrong. That comes from the default value of theconfig/server_addr
SMF property. The sled agent is supposed to be setting that property here to the actual IP address that Nexus provided for that Propolis server. That doesn't seem to always happen.I modified the sled agent to print the values of that property for both the generic
propolis-server
SMF service, and for the specific instance of that service that it adds. Those seem to always show:That is, the "generic" SMF service
svc:/system/illumos/propolis-server
seems to get the provided value. The specific instance of the service, in this casesvc:/system/illumos/propolis-server:vm-5d23c6ef-b4c0-4f0a-90de-fb5001133a5c
has an empty string as its property value. Presumably this is because the instance is supposed to inherit properties from its parent configuration, unless they've been overridden. Unfortunately, those are always the values, when the instance provision succeeds or hangs.However, when it hangs, I've also seen that inspecting the property value with
svccfg
directly from the global zone shows that it remains unmodified:I really don't have much else yet, but it's getting late and I wanted to record my debugging state. I've experimented a bit with setting the property directly on the instance, but that fails with the error:
svccfg: No such property group "config"
. I've also tried callingsvcprop refresh
, but that seems to have no effect.Overall, it appears there's a race or some other inconsistency in how the actual SMF service instance we're launching gets the value of its property
config/server_addr
. My current steps to repro this are, unfortunately, just starting/stopping an instance a ton of times. I can usually hit this after a median of about 5 retries, but it's extremely variable.The text was updated successfully, but these errors were encountered: