Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Testing sagas" is not easy in Omicron, but it should be #1799

Closed
smklein opened this issue Oct 9, 2022 · 7 comments
Closed

"Testing sagas" is not easy in Omicron, but it should be #1799

smklein opened this issue Oct 9, 2022 · 7 comments
Labels
nexus Related to nexus Testing & Analysis Tests & Analyzers

Comments

@smklein
Copy link
Collaborator

smklein commented Oct 9, 2022

Sagas are tricky to write in Nexus right now - writing them not only involves "doing all the operations you need to do", but also:

  • Ensuring that each saga action can be undone by a corresponding saga undo action
  • Ensuring that all the saga actions are idempotent
  • Ensuring that all the saga undo actions are idempotent

So, how do we test sagas today? Mostly through our integration tests, which poke at both the external Nexus API, and internally at the database. This is a pretty coarse-grained mechanism for testing -- it would be much nicer if we could (for example) inject errors directly into individually constructed sagas.

I tried pulling out some saga-constructing goop in the "disk create" saga. My goal was to create a test exercising sagas, and node failure, without requiring most of the rest of the control plane to be up-and-running. Basically, something more akin to a "unit test" than an full "nexus integration test".

The big problem? The SagaContext object we pass to all sagas in Nexus has a reference back to the Nexus struct itself. And initializing the Nexus structure requires bringing up a lot of other services, so we're in a big ball of coupled services.

It would be nice to create some better isolation between the SagaContext object and "all of Nexus" - looser coupling here would make it easier to create fine-grained tests to poke at more saga failure cases.

@smklein smklein changed the title Testing sagas is not easy in Omicron, but it should be "Testing sagas" is not easy in Omicron, but it should be Oct 9, 2022
@smklein
Copy link
Collaborator Author

smklein commented Oct 10, 2022

My first pass at doing this:

  • Making sagas dependent on a well-scoped subset of Nexus actually wasn't particularly difficult. Mostly just a matter of sanitizing what we're doing and where. For example, Nexus shouldn't expose "raw connections to other services" (e.g. sled agent), it should expose more general operations that it can perform (e.g. "set the instance state on a sled"). This makes checking for the right behavior comparatively easy in tests, rather than needing to stand-up mock services.
  • However... The dependency on the "full Nexus" object runs deep, and is really challenging to detangle. For example:
    • SagaContext references the Nexus object directly. If we try to make SagaContext generic over N, which implements the limited interface, then...
    • That bubbles up to impl steno::SagaType for NexusSagaType, so NexusSagaType needs to also be generic over N...
    • This bubbles up to the types ActionRegistry, NexusAction, and NexusActionContext, which also now need to be generic over N...
    • This bubbles up to the actual saga node definitions, which are defined behind lazy_static. These can't be generic, because they're static objects.

@smklein
Copy link
Collaborator Author

smklein commented Oct 10, 2022

02c50b5 has an implementation using a dyn Interface to accomplish this decoupling of sagas and Nexus itself.

Now that I've done that, I'm admittedly not sure if it's the right approach. My real end-goal is to make "the guts of sagas exposed to tests, so I can perform fine-grained operations on them".

Minimizing the interface exposed to Nexus certainly makes this possible, but it doesn't seem like the only approach. It also seems possible to simply add functions to Nexus for tests, exposing more of the internals of saga execution, such that tests can manipulate them.

@davepacheco
Copy link
Collaborator

FWIW, SagaContext is supposed to be the narrower set of interfaces from Nexus that sagas are allowed to use. We could (probably should) remove

pub fn nexus(&self) -> &Arc<Nexus> {
&self.nexus
}
and replace it with whatever more specific interfaces are needed so that you don't need a whole Nexus around to run the saga.

In the limit, saga actions are going to wind up talking to Sled Agent, boundary services, the database -- basically everything. So it's hard to unit test them without mocking those things. And if we do mock those things (as I think we should for most of those), then it's also easy to stand up Nexus, I think.

See also oxidecomputer/steno#31.

@smklein
Copy link
Collaborator Author

smklein commented Oct 10, 2022

FWIW, SagaContext is supposed to be the narrower set of interfaces from Nexus that sagas are allowed to use. We could (probably should) remove

pub fn nexus(&self) -> &Arc<Nexus> {
&self.nexus
}

and replace it with whatever more specific interfaces are needed so that you don't need a whole Nexus around to run the saga.

Yeah, I basically did this in my PR with the dyn NexusForSaga interface, but if that's not the testing boundary, it doesn't actually need to be dynamicallly dispatched.

In the limit, saga actions are going to wind up talking to Sled Agent, boundary services, the database -- basically everything. So it's hard to unit test them without mocking those things. And if we do mock those things (as I think we should for most of those), then it's also easy to stand up Nexus, I think.

This was definitely something I felt ; the distinction of "Nexus' interface for sagas" vs "Nexus' interface for HTTP endpoints" seems very unclear - both have access to:

  • The database
  • External services
  • Starting new sagas

It could be argued that the tight coupling between "Nexus' interface" and "what sagas want to do" means that there isn't much point in trying to isolate them. I don't love that, but I'm not sure the alternative is better.

See also oxidecomputer/steno#31.

Definitely interested here. Even the existing saga_inject_error functionality seems really nice to have, but there is some refactoring in Nexus necessary to expose this to tests.

@smklein smklein added Testing & Analysis Tests & Analyzers nexus Related to nexus labels Nov 16, 2022
@smklein
Copy link
Collaborator Author

smklein commented Dec 13, 2022

#1835 partially worked on this, by testing unwind safety

@gjcolombo
Copy link
Contributor

@smklein Is there any additional saga testing work you were looking to track with this issue? I think the error injection and node replay facilities are pretty robust at this point; it seems like the basic primitives are all there, and we just need to arrange them conveniently (e.g. what I described in #3896) and then make sure the appropriate tests exist for all relevant sagas (like the ones listed in #2052 and #2094).

@smklein
Copy link
Collaborator Author

smklein commented Aug 22, 2023

I think this issue has been appropriately replaced by follow-ups -- I really wanted the idempotency + unwind tests, and now we have both. I'll mark this issue as closed in favor of more recent+specific issues.

@smklein smklein closed this as completed Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus Testing & Analysis Tests & Analyzers
Projects
None yet
Development

No branches or pull requests

3 participants