Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: move migration failure injection into instance specs #809

Open
wants to merge 2 commits into
base: gjcolombo/target-spec-from-migration-source
Choose a base branch
from

Conversation

gjcolombo
Copy link
Contributor

Stacked on #807. Related to #804.

Make the server get migration failure injection device configuration from an instance spec component instead of the config TOML. This is a prelude to a larger change that will make the server stop depending on config TOMLs entirely (in favor of giving clients a library they can use to convert config TOMLs to instance spec components).

PHD still needs to be able to set different import/export failure settings for migration sources and targets, so add the migration failure device to the set of components that can be replaced during preamble processing.

Tests: cargo test, PHD.

@gjcolombo gjcolombo requested a review from hawkw November 13, 2024 20:54
@gjcolombo
Copy link
Contributor Author

The PHD failure here is weird. The failure is in the running_process::import_failure test, which is what I'd expect this change to break if it broke anything. But the failure mode is odd: the first migration fails as expected, but then the source VM immediately gets a stop request, which it immediately handles, rendering it unable to serve as a migration source for the second migration:

{"msg":"migration out failed, resuming","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434800658Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","error":"RemoteError(Destination, \"failed to migrate device state: failed to apply deserialized device state: you have no chance to survive, make your time\")"}
{"msg":"publishing new instance state","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434818339Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","migration":"InstanceMigrateStatusResponse { migration_in: None, migration_out: Some(InstanceMigrationStatus { id: d5a22278-1907-4188-beb1-dbaab29b33a3, state: Error }) }","state":"Running","gen":12}
{"msg":"state driver handled event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434839439Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","outcome":"Continue"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.459488929Z","hostname":"buskin","pid":1013,"uri":"/instance","method":"GET","req_id":"2b877526-cd31-4935-81e0-ab86c6740d17","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":173,"response_code":"200"}
{"msg":"requested state via API","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460691072Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","state":"Stop"}
{"msg":"Queuing external request","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460710913Z","hostname":"buskin","pid":1013,"component":"request_queue","component":"vm_state_driver","disposition":"Enqueue","request":"Stop"}
{"msg":"state driver handling event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460729383Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","event":"ExternalRequest(Stop)"}
{"msg":"stopping instance","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460745983Z","hostname":"buskin","pid":1013,"component":"vm_state_driver"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460760814Z","hostname":"buskin","pid":1013,"uri":"/instance/state","method":"PUT","req_id":"a85e15e1-5993-40b9-8fed-5f7ac47337fd","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":142,"response_code":"204"}

We log state changes requested through the framework's VM state change functions, as well as attempts to stop a VM when it's being torn down, and I don't see anything especially suspicious preceding the VM stopping. My best guess is that there's a synchronization bug in phd_framework::Framework::wait_for_cleanup_tasks that's allowing a VM teardown task from a previous test to run and affect this one, but I haven't yet found any evidence of this.

I also can't reproduce this locally, so I've queued a rerun to see if it happens again. Could probably use some more instrumentation in the PHD VM cleanup logic in any case.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

bin/propolis-server/src/lib/spec/api_spec_v0.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/spec/builder.rs Show resolved Hide resolved
Make the server get migration failure injection device configuration
from an instance spec component instead of the config TOML. This is a
prelude to a larger change that will make the server stop depending on
config TOMLs entirely (in favor of giving clients a library they can use
to convert config TOMLs to instance spec components).

PHD still needs to be able to set different import/export failure
settings for migration sources and targets, so add the migration failure
device to the set of components that can be replaced during preamble
processing.

Tests: cargo test, PHD.
@gjcolombo gjcolombo force-pushed the gjcolombo/target-spec-from-migration-source branch from 7db1183 to 650fca9 Compare December 2, 2024 20:38
@gjcolombo gjcolombo force-pushed the gjcolombo/migration-failure-injection-spec branch from 0ac3bf1 to 15ad57c Compare December 2, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants