[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835

smklein · 2022-10-19T19:29:49Z

What am I trying to do in this PR

Test saga unwind. Sagas are tricky to get right - I want to test that the "undo" actions can actually unwind saga state, and avoid leaving detritus in the system.
Add tests near source code. Sagas are currently tested via integration tests - I want to add a test that can live "as close to the implementation as possible". I'd like to poke at saga-internal details, such as failures at very specific internal nodes, rather than simply relying on the HTTP interface.

What does this PR NOT do

Actually perform unit tests on Sagas. I originally tried to make this PR abstract the interface from "sagas" -> "nexus", so that the interfaces could be mocked out without actually spinning up a full integration test. This turned out to be really gnarly, since the coupling there is so tight. I think this would still be theoretically possible, but it would incur a large interface cost, since the separation between "parts of Nexus used by sagas" and "the rest of Nexus" is ambiguous. EDIT: I actually did make them unit tests; see: e44438d
Test Nexus in isolation. I originally tried to make this PR spin up a Nexus service without using the nexus-test-utils crate - however, this basically meant re-implementing a good chunk of test utilities, for very little benefit.

What does this PR do

The important bits

The "most interesting" parts of this PR exist in nexus/src/app/sagas/disk_create.rs. A test has been added to inject errors in all the saga actions, and observe that the saga can be cleanly unwound. Note - this required adding multiple "undo" actions to pass!

Okay, but other files got updated, why?

nexus/src/app/saga.rs: To be able to inject errors into sagas, the saga DAG creation + execution code needed to be refactored.
nexus/src/app/mod.rs: To expose saga DAGs to tests, allowing for errors to be injected, modules needed to be made public.
nexus/src/db/datastore/mod.rs: To allow saga tests to query the database directly (checking for detritus), access to the connection pool has been made pub.
nexus/test-utils/src/lib.rs: To allow the integration tests in disk_create.rs to safely refer to the same version of Nexus used by nexus-test-utils, the import of the omicron_nexus module is re-exported.

smklein · 2022-10-20T20:56:41Z

I have good news; I'm working on a similar test for the instance_create saga, and I believe I have a way to automate the "failure" injection for each "interesting" node in the saga DAG. I'll post that as a PR once it's ready, but it should be usable for testing all our sagas.

smklein · 2022-10-20T21:54:35Z

I have good news; I'm working on a similar test for the instance_create saga, and I believe I have a way to automate the "failure" injection for each "interesting" node in the saga DAG. I'll post that as a PR once it's ready, but it should be usable for testing all our sagas.

I did this in #1843

davepacheco

Nice! It's great that you're (I think?) fixing a bunch of bugs by adding these undo actions and then testing them.

I did not really dig into the actual undo actions or those tests (focused on the other bits here) -- let me know if you'd like me to do that.

davepacheco · 2022-10-20T22:57:47Z

nexus/src/db/datastore/mod.rs

@@ -122,7 +122,7 @@ impl DataStore {
        self.pool.pool()
    }

-    pub(super) async fn pool_authorized(
+    pub async fn pool_authorized(


I'm a little worried about this...the point of the DataStore is to encapsulate the database with relatively well-defined, safe operations. The visibility change here seems like it would make it very easy for someone to inadvertently grab a database connection and start making raw database queries from the app layer (thinking this is the right thing to do, when it's not).

Would it make sense to add interfaces to the DataStore that expose what you want to use from the tests?

I agree, it only looks like this is used for no_disk_records_exist.

Okay, so this sorta sent me down a rabbit hole.

Between this, and the comment about the weird "sort-of-cyclic" relationship between nexus and nexus-test-utils below, I sorta tried to do a clean-up of test-helper code.

Basically: I wanted to be able to write test code that can do arbitrary DB queries, and call functions that let us poke into parts of Nexus' durable storage trivially. You bring up a great point, this interface being available for non-test purposes would be a clear break of functional boundaries, but IMO, requiring arbitrary, one-off Diesel queries to add boilerplate code to this interface would be a bummer.

So: to reconcile both these things, I'm making this a function that exists only within #[cfg(tests)].

This should be doable. A function that exists for a unit-tests, but which does not exist beyond that functional boundary (so it can't be used incorrectly in a real prod usage).

Should be. Anyway, when I first tried just making that switch, it broke the tests I added to disk_create.rs - they were no longer able to access the test-only function, because they were actually integration tests - the version of omicron_nexus being tested was actually the one exported from nexus-test-utils, which means it's not considered the same as the version of the current crate, also omicron_nexus, but built under #[cfg(test)]. When I originally tried writing the "saga tests" in disk_create.rs, I wanted them to be co-located with the source, to make it easier to eye-ball test coverage, but it was the first test in Nexus to actually use #[nexus_test], which caused the cyclic dependency of nexus-test-utils -> nexus -> nexus-test-utils, which caused the error described here. I wanted to re-use test setup code because my saga tests also wanted to validate state about simulated sled agents (and arguably in the future may want to check metrics information from Clickhouse).

To fix the tests, I had choices:

Make the disk_create.rs test dependencies much smaller, not including some of {CRDB, Clickhouse, Sled Agents}, such that test setup isn't needed. This is possible, but it seems ill-advised -- the population step of Nexus is necessary for authz to function correctly, being able to poke at "sim sled agents" is necessary for all the Nexus HTTP requests to work (unless they could somehow be stubbed out, which seems like a pain), and doing so would also make it more difficult to monitor metrics later, if Clickhouse wasn't up.

Add interfaces to the datastore for test-only queries, keep this test as an integration test. I didn't want to go down this route, because it seems like it would either bloat the interface with lower quality (intended for tests!) non-production code, or raise the bar on the safety/concurrency considerations of test-only queries (which would make tests more difficult to write).

Break the circular dependency the bad way. Duplicate all the code Nexus used to do setup between unit and integration tests by copying everything we need from nexus-test-utils into nexus. This seems doable, but somewhat gnarly, especially since we're dealing with a lot of possible configurations and related services. Additionally, this feels a bit like "cause more problems for future us", which sucks.

Break the circular dependency in a better way. Basically, make nexus-test-utils mostly no longer depend on nexus, such that the test actually can be a unit test, and so we actually can have #[cfg(test)] functions, like this one. I did this.

Anyway, this did end up requiring a fair bit more code changes, but it seems worth it. We can now write unit tests for Nexus that exist side-by-side with complex saga code, and we have the ability to write more powerful unit tests that can access the DB with ease. Furthermore, it's a step in the road for fixing a circular dependency in the build graph.

The change needed to make this all happen is described here: a0e52c9

I'm sorry for sending you down a rabbit hole! Thanks for considering this problem. I think you're right that it's important.

I'm a little on the fence about this solution. This seems like a pretty big change just to be able to get a database handle for the tests. Do I understand right that there's only one impl of the new trait? As a developer I find this pattern kind of annoying -- it feels like a bunch of extra cognitive overhead that's exposed to all the integration tests, plus all the test-utils code. But maybe this is sufficiently hidden by the type definitions you added to each of the integration test files, and hopefully people won't really need to think about the trait very much. I'm curious what anybody else thinks about this. My avoidance of this pattern might be idiosyncratic.

Did you consider the approach of having a DatastoreTestInterfaces trait and putting pool_for_tests into that, similar to what we do for NexusTestInterfaces? I think that trait exists for a very similar purpose and there's some benefit to using the same pattern, I think. Although this is still a trait with only one impl, the only place that needs to know about it are the things that use this very specific interface (pool_for_tests()), as opposed to all the tests.

To be clear, I'm not saying this approach is a non-starter or anything, and if urgency is more important here, we can just go ahead with it!

In multi-crate environments, I think it makes sense to think of NexusServer and NexusServer, but built with #[cfg(test)], as two different structures. From the perspective of omicron-nexus using nexus-test-utils vs the integration tests using nexus-test-utils, those really are two different structures, even with a single "impl" body.

I actually do really like the #[cfg(test)] mechanism of wrapping test helpers, because of a few reasons:

This code is never compiled into the real production binary, so it reduces the impact on the resulting binaries. If we want to add tons of diesel queries to test-only configurations, we 100% can, and the cost on the prod version of nexus should be non-existent.

It makes a distinction between "interfaces usable by unit tests" vs "interfaces usable by integration tests". If integration tests need to be so invasive, they are probably better suited as unit tests.

It makes it impossible to use these test interfaces from a non-test environment.

If we go down the route of "add a DatastoreTestInterfaces, revert e44438d", I think I'll run into a problem:

nexus-test-utils would still have a dependency on omicron-nexus, so there would still be the issue that unit tests for nexus cannot just use crate::... things, they would need to use nexus_test_utils::re_exported_nexus::* to be able to interoperate with the integration test setup.

I think this quirkiness, especially for unit tests, is the most confusing piece here. It threw me off quite a bit when writing these tests, and took a fair amount of digging to sort through.

If there are test interfaces that we want to expose to integration tests, I think we can. Nothing prevents us from both using a generic test framework, and also exposing a DatastoreTestInterfaces trait, but it seems preferable to limit that interface to "thing truly needed by integration tests, not just unit tests".

nexus/Cargo.toml

nexus/src/app/saga.rs

davepacheco · 2022-10-20T23:08:38Z

nexus/test-utils/src/lib.rs

@@ -23,6 +23,9 @@ use std::path::Path;
 use std::time::Duration;
 use uuid::Uuid;

+// Expose the version of Nexus used by the test utils.
+pub use omicron_nexus as nexus;


If the caller depended on omicron_nexus themselves, is there any real chance it could differ (since it would be a path dependency in the same repo)?

Not saying this is a bad pattern... now I wonder: should we do the same thing with sled agent instead of having omicron-nexus have a dev-dep on omicron-sled-agent?

If the caller depended on omicron_nexus themselves, is there any real chance it could differ (since it would be a path dependency in the same repo)?

This is related to my comment here: #1835 (comment)

nexus-test-utils, under nexus/test-utils, has a weird cyclic relationship with the omicron-nexus crate (note: It has this relationship independent of this PR). nexus-test-utils depends on omicron-nexus, but omicron-nexus has a dev-dependency back on nexus-test-utils.

These both appear to be path-based dependencies.

If I don't use this version of nexus, and instead, in the disk_create.rs test, I attempt to use Nexus symbols directly, I get errors like the following:

error[E0308]: mismatched types --> nexus/src/app/sagas/disk_create.rs:662:38 | 662 | let params = new_test_params(&opctx, project_id); | --------------- ^^^^^^ expected struct `context::OpContext`, found struct `omicron_nexus::context::OpContext` | | | arguments to this function are incorrect | = note: struct `omicron_nexus::context::OpContext` and struct `context::OpContext` have similar names, but are actually distinct types note: struct `omicron_nexus::context::OpContext` is defined in crate `omicron_nexus` --> /home/smklein/repos/oxide/omicron/nexus/src/context.rs:223:1 | 223 | pub struct OpContext { | ^^^^^^^^^^^^^^^^^^^^ note: struct `context::OpContext` is defined in the current crate --> nexus/src/context.rs:223:1 | 223 | pub struct OpContext { | ^^^^^^^^^^^^^^^^^^^^ note: function defined here --> nexus/src/app/sagas/disk_create.rs:633:8 | 633 | fn new_test_params(opctx: &OpContext, project_id: Uuid) -> Params { | ^^^^^^^^^^^^^^^ -----------------

So, for whatever reason, they can't be consolidated as a single type.

Not saying this is a bad pattern... now I wonder: should we do the same thing with sled agent instead of having omicron-nexus have a dev-dep on omicron-sled-agent?

This doesn't seem necessary for accessing the omicron-sled-agent functions from the Nexus test. I'm hesitant to start exposing all the internals of nexus-test-utils publicly though -- for any crate used by a Nexus test, that also happens to be in nexus-test-utils, should I re-export it? There's a lot more than just the sled agent here; nexus-test-utils and nexus both have dependencies on:

anyhow

chrono

dropshot

headers

http

omicron-common

oximeter

oximeter-client

oximeter-producer

serde

serde_json

serde_urlencoded

slog

uuid

Should I also be re-exporting all these crates?

It definitely doesn't make sense to me to expose all of those. But does it not feel strange that we have to do this with just this one crate? This sort-of-cyclic relationship seems kind of fraught but I don't have any specific suggestion for improving it so 🤷

See the explanation above, surrounding e44438d . This circular dependency has been... not entirely removed, but reduced enough that I don't need this pub use statement.

jmpesp

This is awesome

nexus/src/app/sagas/disk_create.rs

jmpesp · 2022-10-24T19:27:18Z

nexus/src/db/datastore/mod.rs

@@ -122,7 +122,7 @@ impl DataStore {
        self.pool.pool()
    }

-    pub(super) async fn pool_authorized(
+    pub async fn pool_authorized(


I agree, it only looks like this is used for no_disk_records_exist.

Not actually. There is definitely still a dependency, but the "test setup" part can be re-used with Nexus tests without causing any real circular dependency issues. See #1835 (comment) for an example of one such issue. Frankly the rest of nexus-test-utils probably *should* continue to be refactored such that there is no real dependency on Nexus; doing so will reduce the risk of "mis-matched versions of Nexus" classes of errors in the future. This required the following changes: - Move `nexus`'s config information into `common/src/nexus_config.rs`, so that it can be used by `nexus-test-utils`. - Make `nexus-test-utils` operate on a generic version of `NexusServer`, exposed by a new crate named `nexus-test-interface`. This crate can contain any information used to operate Nexus during shared test setup, without actually requiring a full dependency on `nexus`. - Note, adding this generic required making `ControlPlaneTestContext` generic, which ended up bubbling up to a lot of test setup functions, including the one used in the `#[nexus_test]` macro... - ... So, make it possible to customize which version of `NexusServer` is used by `#[nexus_test]`. This means callers can supply their *own* version of `NexusServer`, either from their own crate (see: Nexus unit tests) or the version from the `omicron_nexus` crate (see: Nexus integration tests).

smklein · 2022-11-02T15:54:50Z

nexus/src/app/sagas/disk_create.rs

+        let params = new_test_params(&opctx, project_id);
+        let dag = create_saga_dag::<SagaDiskCreate>(params).unwrap();
+
+        for node in dag.get_nodes() {


With this latest update, we can simply iterate over the entire dag, rather than enumerating nodes of interest. This should be a lot more future-proof!

davepacheco · 2022-11-02T17:23:11Z

Sorry for not noticing this earlier, but I see that this just undid #1548 / #1549. I'm happy to re-release Steno and then can we get Omicron back on a crates.io dep?

smklein · 2022-11-02T17:37:37Z

Sorry for not noticing this earlier, but I see that this just undid #1548 / #1549. I'm happy to re-release Steno and then can we get Omicron back on a crates.io dep?

Ah, yeah, I'm happy to switch this back to a versioned schema if you could release a new Steno rev?

Builds on #1835 and oxidecomputer/steno#67 Adds a test for the "instance create" saga's unwind safety by: - Using a saga which attempts to create a network interface, allocate an external IP, and attach a disk - Failing after *every single intermediate node* in the saga - Verifying that no intermediate resources remain after any of the attempted executions

smklein added 6 commits October 9, 2022 19:45

VERY wip; basically just refactoring Nexus interfaces for testability

02c50b5

hacks

971f0b8

Merge branch 'main' into disk-create-idempotency

f8081d1

Merge branch 'main' into disk-create-idempotency

7bcf9d6

Add unwind test, add undo actions

31bbc22

Undo the saga interface changes

a8f9e5d

smklein requested review from jmpesp and davepacheco October 19, 2022 19:29

smklein mentioned this pull request Oct 20, 2022

[nexus] Add tests for instance_create unwind safety #1843

Merged

davepacheco reviewed Oct 20, 2022

View reviewed changes

jmpesp approved these changes Oct 24, 2022

View reviewed changes

smklein force-pushed the disk-create-idempotency branch from a0e52c9 to e44438d Compare October 26, 2022 03:07

smklein added 8 commits October 25, 2022 23:17

cargo doc didn't like my ascii art

f210c6d

Add prefix

0d1f8bf

Merge branch 'main' into disk-create-idempotency

3ad8db3

Merge branch 'main' into disk-create-idempotency

35a1b73

Merge branch 'main' into disk-create-idempotency

6cc0e03

Merge branch 'main' into disk-create-idempotency

de670d1

Use new steno API to automatically iterate over all nodes

bd5cd0a

fmt

d63fed2

smklein commented Nov 2, 2022

View reviewed changes

smklein enabled auto-merge (squash) November 2, 2022 15:56

smklein added 2 commits November 2, 2022 12:36

Merge branch 'main' into disk-create-idempotency

5c765fc

Fix steno version

2d06f71

smklein merged commit 04a8182 into main Nov 2, 2022

smklein deleted the disk-create-idempotency branch November 2, 2022 17:21

smklein mentioned this pull request Nov 2, 2022

Use steno 0.3 #1911

Merged

This was referenced Dec 13, 2022

"Testing sagas" is not easy in Omicron, but it should be #1799

Closed

Tracking issue for saga unwind safety #2052

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835

[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835

smklein commented Oct 19, 2022 •

edited

Loading

smklein commented Oct 20, 2022

smklein commented Oct 20, 2022

davepacheco left a comment

davepacheco Oct 20, 2022

jmpesp Oct 24, 2022

smklein Oct 26, 2022

davepacheco Oct 28, 2022

davepacheco Oct 28, 2022

smklein Oct 28, 2022 •

edited

Loading

davepacheco Oct 20, 2022

smklein Oct 21, 2022 •

edited

Loading

davepacheco Oct 21, 2022

smklein Oct 26, 2022

jmpesp left a comment

jmpesp Oct 24, 2022

smklein Nov 2, 2022

davepacheco commented Nov 2, 2022

smklein commented Nov 2, 2022

[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835

[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835

Conversation

smklein commented Oct 19, 2022 • edited Loading

What am I trying to do in this PR

What does this PR NOT do

What does this PR do

The important bits

Okay, but other files got updated, why?

smklein commented Oct 20, 2022

smklein commented Oct 20, 2022

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Nov 2, 2022

smklein commented Nov 2, 2022

smklein commented Oct 19, 2022 •

edited

Loading

smklein Oct 28, 2022 •

edited

Loading

smklein Oct 21, 2022 •

edited

Loading