"Testing sagas" is not easy in Omicron, but it should be #1799

smklein · 2022-10-09T23:39:24Z

Sagas are tricky to write in Nexus right now - writing them not only involves "doing all the operations you need to do", but also:

Ensuring that each saga action can be undone by a corresponding saga undo action
Ensuring that all the saga actions are idempotent
Ensuring that all the saga undo actions are idempotent

So, how do we test sagas today? Mostly through our integration tests, which poke at both the external Nexus API, and internally at the database. This is a pretty coarse-grained mechanism for testing -- it would be much nicer if we could (for example) inject errors directly into individually constructed sagas.

I tried pulling out some saga-constructing goop in the "disk create" saga. My goal was to create a test exercising sagas, and node failure, without requiring most of the rest of the control plane to be up-and-running. Basically, something more akin to a "unit test" than an full "nexus integration test".

The big problem? The SagaContext object we pass to all sagas in Nexus has a reference back to the Nexus struct itself. And initializing the Nexus structure requires bringing up a lot of other services, so we're in a big ball of coupled services.

It would be nice to create some better isolation between the SagaContext object and "all of Nexus" - looser coupling here would make it easier to create fine-grained tests to poke at more saga failure cases.

The text was updated successfully, but these errors were encountered:

smklein · 2022-10-10T01:47:48Z

My first pass at doing this:

Making sagas dependent on a well-scoped subset of Nexus actually wasn't particularly difficult. Mostly just a matter of sanitizing what we're doing and where. For example, Nexus shouldn't expose "raw connections to other services" (e.g. sled agent), it should expose more general operations that it can perform (e.g. "set the instance state on a sled"). This makes checking for the right behavior comparatively easy in tests, rather than needing to stand-up mock services.
However... The dependency on the "full Nexus" object runs deep, and is really challenging to detangle. For example:
- SagaContext references the Nexus object directly. If we try to make SagaContext generic over N, which implements the limited interface, then...
- That bubbles up to impl steno::SagaType for NexusSagaType, so NexusSagaType needs to also be generic over N...
- This bubbles up to the types ActionRegistry, NexusAction, and NexusActionContext, which also now need to be generic over N...
- This bubbles up to the actual saga node definitions, which are defined behind lazy_static. These can't be generic, because they're static objects.

smklein · 2022-10-10T16:33:28Z

02c50b5 has an implementation using a dyn Interface to accomplish this decoupling of sagas and Nexus itself.

Now that I've done that, I'm admittedly not sure if it's the right approach. My real end-goal is to make "the guts of sagas exposed to tests, so I can perform fine-grained operations on them".

Minimizing the interface exposed to Nexus certainly makes this possible, but it doesn't seem like the only approach. It also seems possible to simply add functions to Nexus for tests, exposing more of the internals of saga execution, such that tests can manipulate them.

davepacheco · 2022-10-10T18:18:57Z

FWIW, SagaContext is supposed to be the narrower set of interfaces from Nexus that sagas are allowed to use. We could (probably should) remove

omicron/nexus/src/saga_interface.rs

Lines 48 to 50 in fbe8dc5

    
           pub fn nexus(&self) -> &Arc<Nexus> { 
        
               &self.nexus 
        
           }

and replace it with whatever more specific interfaces are needed so that you don't need a whole Nexus around to run the saga.

In the limit, saga actions are going to wind up talking to Sled Agent, boundary services, the database -- basically everything. So it's hard to unit test them without mocking those things. And if we do mock those things (as I think we should for most of those), then it's also easy to stand up Nexus, I think.

See also oxidecomputer/steno#31.

smklein · 2022-10-10T18:24:53Z

FWIW, SagaContext is supposed to be the narrower set of interfaces from Nexus that sagas are allowed to use. We could (probably should) remove

omicron/nexus/src/saga_interface.rs

Lines 48 to 50 in fbe8dc5

pub fn nexus(&self) -> &Arc<Nexus> {

&self.nexus

}

and replace it with whatever more specific interfaces are needed so that you don't need a whole Nexus around to run the saga.

Yeah, I basically did this in my PR with the dyn NexusForSaga interface, but if that's not the testing boundary, it doesn't actually need to be dynamicallly dispatched.

In the limit, saga actions are going to wind up talking to Sled Agent, boundary services, the database -- basically everything. So it's hard to unit test them without mocking those things. And if we do mock those things (as I think we should for most of those), then it's also easy to stand up Nexus, I think.

This was definitely something I felt ; the distinction of "Nexus' interface for sagas" vs "Nexus' interface for HTTP endpoints" seems very unclear - both have access to:

The database
External services
Starting new sagas

It could be argued that the tight coupling between "Nexus' interface" and "what sagas want to do" means that there isn't much point in trying to isolate them. I don't love that, but I'm not sure the alternative is better.

See also oxidecomputer/steno#31.

Definitely interested here. Even the existing saga_inject_error functionality seems really nice to have, but there is some refactoring in Nexus necessary to expose this to tests.

smklein · 2022-12-13T18:31:44Z

#1835 partially worked on this, by testing unwind safety

gjcolombo · 2023-08-18T03:19:43Z

@smklein Is there any additional saga testing work you were looking to track with this issue? I think the error injection and node replay facilities are pretty robust at this point; it seems like the basic primitives are all there, and we just need to arrange them conveniently (e.g. what I described in #3896) and then make sure the appropriate tests exist for all relevant sagas (like the ones listed in #2052 and #2094).

smklein · 2023-08-22T22:11:03Z

I think this issue has been appropriately replaced by follow-ups -- I really wanted the idempotency + unwind tests, and now we have both. I'll mark this issue as closed in favor of more recent+specific issues.

smklein changed the title ~~Testing sagas is not easy in Omicron, but it should be~~ "Testing sagas" is not easy in Omicron, but it should be Oct 9, 2022

smklein added Testing & Analysis Tests & Analyzers nexus Related to nexus labels Nov 16, 2022

smklein mentioned this issue Dec 13, 2022

Tracking issue for saga unwind safety #2052

Open

13 tasks

smklein mentioned this issue Dec 26, 2022

Tracking issue for saga idempotency safety #2094

Open

13 tasks

jordanhendricks mentioned this issue Aug 17, 2023

want generic framework for common saga tests #3896

Closed

smklein closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Testing sagas" is not easy in Omicron, but it should be #1799

"Testing sagas" is not easy in Omicron, but it should be #1799

smklein commented Oct 9, 2022 •

edited

Loading

smklein commented Oct 10, 2022

smklein commented Oct 10, 2022

davepacheco commented Oct 10, 2022

smklein commented Oct 10, 2022

smklein commented Dec 13, 2022

gjcolombo commented Aug 18, 2023

smklein commented Aug 22, 2023

"Testing sagas" is not easy in Omicron, but it should be #1799

"Testing sagas" is not easy in Omicron, but it should be #1799

Comments

smklein commented Oct 9, 2022 • edited Loading

smklein commented Oct 10, 2022

smklein commented Oct 10, 2022

davepacheco commented Oct 10, 2022

smklein commented Oct 10, 2022

smklein commented Dec 13, 2022

gjcolombo commented Aug 18, 2023

smklein commented Aug 22, 2023

smklein commented Oct 9, 2022 •

edited

Loading