Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance specs could be improved #735

Closed
gjcolombo opened this issue Aug 1, 2024 · 2 comments · Fixed by #767
Closed

instance specs could be improved #735

gjcolombo opened this issue Aug 1, 2024 · 2 comments · Fixed by #767
Assignees
Labels
api Related to the API. server Related specifically to the Propolis server API and its VM management functions.

Comments

@gjcolombo
Copy link
Contributor

The problem(s)

The instance spec types in the propolis_api_types crate are an important conceptual part of the live migration protocol, but there are several things about the way I implemented them that leave things to be desired:

  • Well-formed specs are not necessarily valid

It's easy to construct an instance spec that type-checks but can't be used to start an instance. For example, it's trivial to have a spec with a storage device with no valid backend.

  • Specs have a complex hierarchy of component types

Instance specs are composed of device specs and backend specs, each of which are further divided into their own fields and sub-collections. This makes adding new components while respecting the spec versioning rules something of a hassle (and the extra structure doesn't even guarantee the specs are valid!).

  • Not all components have names

Live migration assumes that all (migratory) VM components have names. This allows Propolis to send a source component's saved state to the correct counterpart on the target. While many spec components are stored in a HashMap<String, Component>, some are not, so Propolis has to generate names for them, which then have to be kept consistent between Propolis versions.

  • There are three different spec builders

One lives in propolis_api_types. propolis-server then wraps that in its own builder, which can also convert config TOML entries and InstanceEnsureRequest components to spec components. Yet a third builder lives in propolis-client; it's basically the same as the builder in propolis_api_types, except that it operates on Progenitor-generated types instead of the raw types.

  • propolis_api_types contains a lot more than just types

For a crate that's ostensibly just a bag of types, this crate has a lot of code, most of which is only ever used by Propolis servers. Other Propolis components that want Just The Types, Please shouldn't need to pull in all the server-only methods.

  • Some component types are conditionally compiled

The SoftNPU component definitions in propolis_api_types are behind Falcon feature guards. This forces us to maintain two separate OpenAPI documents, one with Falcon enabled and one without, which can lead to annoying rake-steps in CI if one forgets to check the build for one of the two variants. (In fact, as I write this, PHD builds are broken when Falcon is enabled because of a missing patch directive in the Falcon client generator...)

Proposed solutions

I propose the following changes to fix these problems:

  • Create an internal Spec type for propolis-server

This type is the VM component manifest type that Propolis server uses internally (in particular during VM setup). This spec isn't Serialize, isn't part of the API, and isn't used in the live migration protocol. Because this spec can be structured however we like without concern for versioning it, it can (and will) be structured so that illegal states are unrepresentable.

  • Flatten propolis_api_types::v0::InstanceSpecV0

Instead of a complex hierarchy of component types, the "on the wire" spec used in the API and LM protocol is just a mainboard and a HashMap<String, Component>. Adding a new component type is very easy: just define a struct for it and add it to enum Component. Because all components are now in a map, they all have names.

  • Define conversions between the two spec types in propolis-server

propolis-server implements From<InternalSpec> for VersionedInstanceSpec> and TryFrom<VersionedInstanceSpec> for InternalSpec and converts to/from its internal type at the API and LM protocol boundaries. The TryFrom impl gives us a "parse, don't validate" discipline for incoming API specs.

  • Just one builder in propolis-server

propolis-server understands how to convert config TOML elements and InstanceEnsureRequest fields into spec elements and feeds them into a single InternalSpec builder. The TryFrom<VersionedInstanceSpec> impl also operates in terms of this builder. The propolis_api_types builder and propolis_client builders are removed. The only user of the latter (PHD) builds API specs simply by adding components to a new V0 spec's map. (This plan assumes there won't eventually be other Propolis clients who want to initialize VMs using raw specs. I had originally planned for the control plane to do this, but have changed my mind and now think it should continue using the existing instance-ensure API. See the headnote to RFD 344 for more details.)

  • Move migration compat checks into propolis-server

The migration compatibility traits that live in propolis_api_types are only useful in the LM protocol. They should live in propolis-server instead (which can then implement them in terms of its internal spec representation, if it's convenient to do so).

  • Remove #[cfg(feature)] guards from API types

Instead of compiling components out and having different clients built with different feature sets, there's just one client that defines all possible component types. If a server doesn't support a component a client has asked for, it rejects the spec (once again made easier by having an internal spec type to parse into--if a feature is compiled out of the server, then the internal type won't have fields for its components).


I have drafted a set of changes that achieves these goals and am pretty happy with how it looks; indeed, I've mostly filed this issue to have a wall of text to link the PRs to. I'll start putting those up for review presently.

Doing this now lays the groundwork for adding CPUID information to instance specs and initializing migration targets from sources' specifications. (It's also much easier to correct foundational mistakes like these while the API and migration protocol can still change without strictly requiring new versions of everything.)

@gjcolombo gjcolombo added api Related to the API. server Related specifically to the Propolis server API and its VM management functions. labels Aug 1, 2024
@gjcolombo gjcolombo self-assigned this Aug 1, 2024
@hawkw
Copy link
Member

hawkw commented Aug 1, 2024

All of the recommendations described here seem good to me!

@gjcolombo
Copy link
Contributor Author

Another improvement for the pile: propolis_api_types contains some non-default Debug impls for some of the component types, such as the blob storage backend1 and Crucible backend. Over in propolis-client, however, we're using the Progenitor patch directive to add Debug impls to several top-level instance spec types, which (I believe) causes Progenitor to derive Debug for all of the possible component types. The upshot of this is that we can't manually derive Debug for the generated types in the client, which can cause their output to be... verbose2. It would be nice to fix this; the simplest thing is likely to be to switch to a Progenitor replace directive once the final component type structure is in place.

Footnotes

  1. base64: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"

  2. AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the API. server Related specifically to the Propolis server API and its VM management functions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants