Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835

Merged
merged 17 commits into from
Nov 2, 2022

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Oct 19, 2022

What am I trying to do in this PR

  • Test saga unwind. Sagas are tricky to get right - I want to test that the "undo" actions can actually unwind saga state, and avoid leaving detritus in the system.
  • Add tests near source code. Sagas are currently tested via integration tests - I want to add a test that can live "as close to the implementation as possible". I'd like to poke at saga-internal details, such as failures at very specific internal nodes, rather than simply relying on the HTTP interface.

What does this PR NOT do

  • Actually perform unit tests on Sagas. I originally tried to make this PR abstract the interface from "sagas" -> "nexus", so that the interfaces could be mocked out without actually spinning up a full integration test. This turned out to be really gnarly, since the coupling there is so tight. I think this would still be theoretically possible, but it would incur a large interface cost, since the separation between "parts of Nexus used by sagas" and "the rest of Nexus" is ambiguous. EDIT: I actually did make them unit tests; see: e44438d
  • Test Nexus in isolation. I originally tried to make this PR spin up a Nexus service without using the nexus-test-utils crate - however, this basically meant re-implementing a good chunk of test utilities, for very little benefit.

What does this PR do

The important bits

  • The "most interesting" parts of this PR exist in nexus/src/app/sagas/disk_create.rs. A test has been added to inject errors in all the saga actions, and observe that the saga can be cleanly unwound. Note - this required adding multiple "undo" actions to pass!

Okay, but other files got updated, why?

  • nexus/src/app/saga.rs: To be able to inject errors into sagas, the saga DAG creation + execution code needed to be refactored.
  • nexus/src/app/mod.rs: To expose saga DAGs to tests, allowing for errors to be injected, modules needed to be made public.
  • nexus/src/db/datastore/mod.rs: To allow saga tests to query the database directly (checking for detritus), access to the connection pool has been made pub.
  • nexus/test-utils/src/lib.rs: To allow the integration tests in disk_create.rs to safely refer to the same version of Nexus used by nexus-test-utils, the import of the omicron_nexus module is re-exported.

@smklein smklein requested review from jmpesp and davepacheco October 19, 2022 19:29
@smklein
Copy link
Collaborator Author

smklein commented Oct 20, 2022

I have good news; I'm working on a similar test for the instance_create saga, and I believe I have a way to automate the "failure" injection for each "interesting" node in the saga DAG. I'll post that as a PR once it's ready, but it should be usable for testing all our sagas.

@smklein
Copy link
Collaborator Author

smklein commented Oct 20, 2022

I have good news; I'm working on a similar test for the instance_create saga, and I believe I have a way to automate the "failure" injection for each "interesting" node in the saga DAG. I'll post that as a PR once it's ready, but it should be usable for testing all our sagas.

I did this in #1843

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's great that you're (I think?) fixing a bunch of bugs by adding these undo actions and then testing them.

I did not really dig into the actual undo actions or those tests (focused on the other bits here) -- let me know if you'd like me to do that.

@@ -122,7 +122,7 @@ impl DataStore {
self.pool.pool()
}

pub(super) async fn pool_authorized(
pub async fn pool_authorized(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about this...the point of the DataStore is to encapsulate the database with relatively well-defined, safe operations. The visibility change here seems like it would make it very easy for someone to inadvertently grab a database connection and start making raw database queries from the app layer (thinking this is the right thing to do, when it's not).

Would it make sense to add interfaces to the DataStore that expose what you want to use from the tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it only looks like this is used for no_disk_records_exist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so this sorta sent me down a rabbit hole.

Between this, and the comment about the weird "sort-of-cyclic" relationship between nexus and nexus-test-utils below, I sorta tried to do a clean-up of test-helper code.

Basically: I wanted to be able to write test code that can do arbitrary DB queries, and call functions that let us poke into parts of Nexus' durable storage trivially. You bring up a great point, this interface being available for non-test purposes would be a clear break of functional boundaries, but IMO, requiring arbitrary, one-off Diesel queries to add boilerplate code to this interface would be a bummer.

So: to reconcile both these things, I'm making this a function that exists only within #[cfg(tests)].

This should be doable. A function that exists for a unit-tests, but which does not exist beyond that functional boundary (so it can't be used incorrectly in a real prod usage).

Should be. Anyway, when I first tried just making that switch, it broke the tests I added to disk_create.rs - they were no longer able to access the test-only function, because they were actually integration tests - the version of omicron_nexus being tested was actually the one exported from nexus-test-utils, which means it's not considered the same as the version of the current crate, also omicron_nexus, but built under #[cfg(test)]. When I originally tried writing the "saga tests" in disk_create.rs, I wanted them to be co-located with the source, to make it easier to eye-ball test coverage, but it was the first test in Nexus to actually use #[nexus_test], which caused the cyclic dependency of nexus-test-utils -> nexus -> nexus-test-utils, which caused the error described here. I wanted to re-use test setup code because my saga tests also wanted to validate state about simulated sled agents (and arguably in the future may want to check metrics information from Clickhouse).

To fix the tests, I had choices:

  • Make the disk_create.rs test dependencies much smaller, not including some of {CRDB, Clickhouse, Sled Agents}, such that test setup isn't needed. This is possible, but it seems ill-advised -- the population step of Nexus is necessary for authz to function correctly, being able to poke at "sim sled agents" is necessary for all the Nexus HTTP requests to work (unless they could somehow be stubbed out, which seems like a pain), and doing so would also make it more difficult to monitor metrics later, if Clickhouse wasn't up.
  • Add interfaces to the datastore for test-only queries, keep this test as an integration test. I didn't want to go down this route, because it seems like it would either bloat the interface with lower quality (intended for tests!) non-production code, or raise the bar on the safety/concurrency considerations of test-only queries (which would make tests more difficult to write).
  • Break the circular dependency the bad way. Duplicate all the code Nexus used to do setup between unit and integration tests by copying everything we need from nexus-test-utils into nexus. This seems doable, but somewhat gnarly, especially since we're dealing with a lot of possible configurations and related services. Additionally, this feels a bit like "cause more problems for future us", which sucks.
  • Break the circular dependency in a better way. Basically, make nexus-test-utils mostly no longer depend on nexus, such that the test actually can be a unit test, and so we actually can have #[cfg(test)] functions, like this one. I did this.

Anyway, this did end up requiring a fair bit more code changes, but it seems worth it. We can now write unit tests for Nexus that exist side-by-side with complex saga code, and we have the ability to write more powerful unit tests that can access the DB with ease. Furthermore, it's a step in the road for fixing a circular dependency in the build graph.

The change needed to make this all happen is described here: a0e52c9

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for sending you down a rabbit hole! Thanks for considering this problem. I think you're right that it's important.

I'm a little on the fence about this solution. This seems like a pretty big change just to be able to get a database handle for the tests. Do I understand right that there's only one impl of the new trait? As a developer I find this pattern kind of annoying -- it feels like a bunch of extra cognitive overhead that's exposed to all the integration tests, plus all the test-utils code. But maybe this is sufficiently hidden by the type definitions you added to each of the integration test files, and hopefully people won't really need to think about the trait very much. I'm curious what anybody else thinks about this. My avoidance of this pattern might be idiosyncratic.

Did you consider the approach of having a DatastoreTestInterfaces trait and putting pool_for_tests into that, similar to what we do for NexusTestInterfaces? I think that trait exists for a very similar purpose and there's some benefit to using the same pattern, I think. Although this is still a trait with only one impl, the only place that needs to know about it are the things that use this very specific interface (pool_for_tests()), as opposed to all the tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not saying this approach is a non-starter or anything, and if urgency is more important here, we can just go ahead with it!

Copy link
Collaborator Author

@smklein smklein Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In multi-crate environments, I think it makes sense to think of NexusServer and NexusServer, but built with #[cfg(test)], as two different structures. From the perspective of omicron-nexus using nexus-test-utils vs the integration tests using nexus-test-utils, those really are two different structures, even with a single "impl" body.

I actually do really like the #[cfg(test)] mechanism of wrapping test helpers, because of a few reasons:

  • This code is never compiled into the real production binary, so it reduces the impact on the resulting binaries. If we want to add tons of diesel queries to test-only configurations, we 100% can, and the cost on the prod version of nexus should be non-existent.
  • It makes a distinction between "interfaces usable by unit tests" vs "interfaces usable by integration tests". If integration tests need to be so invasive, they are probably better suited as unit tests.
  • It makes it impossible to use these test interfaces from a non-test environment.

If we go down the route of "add a DatastoreTestInterfaces, revert e44438d", I think I'll run into a problem:

nexus-test-utils would still have a dependency on omicron-nexus, so there would still be the issue that unit tests for nexus cannot just use crate::... things, they would need to use nexus_test_utils::re_exported_nexus::* to be able to interoperate with the integration test setup.

I think this quirkiness, especially for unit tests, is the most confusing piece here. It threw me off quite a bit when writing these tests, and took a fair amount of digging to sort through.

If there are test interfaces that we want to expose to integration tests, I think we can. Nothing prevents us from both using a generic test framework, and also exposing a DatastoreTestInterfaces trait, but it seems preferable to limit that interface to "thing truly needed by integration tests, not just unit tests".

nexus/Cargo.toml Show resolved Hide resolved
nexus/src/app/saga.rs Show resolved Hide resolved
@@ -23,6 +23,9 @@ use std::path::Path;
use std::time::Duration;
use uuid::Uuid;

// Expose the version of Nexus used by the test utils.
pub use omicron_nexus as nexus;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the caller depended on omicron_nexus themselves, is there any real chance it could differ (since it would be a path dependency in the same repo)?

Not saying this is a bad pattern... now I wonder: should we do the same thing with sled agent instead of having omicron-nexus have a dev-dep on omicron-sled-agent?

Copy link
Collaborator Author

@smklein smklein Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the caller depended on omicron_nexus themselves, is there any real chance it could differ (since it would be a path dependency in the same repo)?

This is related to my comment here: #1835 (comment)

  • nexus-test-utils, under nexus/test-utils, has a weird cyclic relationship with the omicron-nexus crate (note: It has this relationship independent of this PR). nexus-test-utils depends on omicron-nexus, but omicron-nexus has a dev-dependency back on nexus-test-utils.
  • These both appear to be path-based dependencies.
  • If I don't use this version of nexus, and instead, in the disk_create.rs test, I attempt to use Nexus symbols directly, I get errors like the following:
error[E0308]: mismatched types                                                                                         
   --> nexus/src/app/sagas/disk_create.rs:662:38                                                                       
    |                                                                                                                  
662 |         let params = new_test_params(&opctx, project_id);                                                        
    |                      --------------- ^^^^^^ expected struct `context::OpContext`, found struct `omicron_nexus::context::OpContext`
    |                      |                                                                                                                                                                                                                   
    |                      arguments to this function are incorrect                                                                                                                                                                            
    |                                                                                                                                                                                                                                          
    = note: struct `omicron_nexus::context::OpContext` and struct `context::OpContext` have similar names, but are actually distinct types
note: struct `omicron_nexus::context::OpContext` is defined in crate `omicron_nexus`                                                                                                                                                           
   --> /home/smklein/repos/oxide/omicron/nexus/src/context.rs:223:1                                                    
    |                                                                                                                  
223 | pub struct OpContext {                                                                                           
    | ^^^^^^^^^^^^^^^^^^^^                                                                                             
note: struct `context::OpContext` is defined in the current crate                                                      
   --> nexus/src/context.rs:223:1                                                                                      
    |                                                                                                                  
223 | pub struct OpContext {                                                                                           
    | ^^^^^^^^^^^^^^^^^^^^                                                                                             
note: function defined here                                                                                            
   --> nexus/src/app/sagas/disk_create.rs:633:8                                                                        
    |                                                                                                                  
633 |     fn new_test_params(opctx: &OpContext, project_id: Uuid) -> Params {                                          
    |        ^^^^^^^^^^^^^^^ -----------------                                                 

So, for whatever reason, they can't be consolidated as a single type.

Not saying this is a bad pattern... now I wonder: should we do the same thing with sled agent instead of having omicron-nexus have a dev-dep on omicron-sled-agent?

This doesn't seem necessary for accessing the omicron-sled-agent functions from the Nexus test. I'm hesitant to start exposing all the internals of nexus-test-utils publicly though -- for any crate used by a Nexus test, that also happens to be in nexus-test-utils, should I re-export it? There's a lot more than just the sled agent here; nexus-test-utils and nexus both have dependencies on:

  • anyhow
  • chrono
  • dropshot
  • headers
  • http
  • omicron-common
  • oximeter
  • oximeter-client
  • oximeter-producer
  • serde
  • serde_json
  • serde_urlencoded
  • slog
  • uuid

Should I also be re-exporting all these crates?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely doesn't make sense to me to expose all of those. But does it not feel strange that we have to do this with just this one crate? This sort-of-cyclic relationship seems kind of fraught but I don't have any specific suggestion for improving it so 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the explanation above, surrounding e44438d . This circular dependency has been... not entirely removed, but reduced enough that I don't need this pub use statement.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome

nexus/src/app/sagas/disk_create.rs Outdated Show resolved Hide resolved
@@ -122,7 +122,7 @@ impl DataStore {
self.pool.pool()
}

pub(super) async fn pool_authorized(
pub async fn pool_authorized(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it only looks like this is used for no_disk_records_exist.

smklein added a commit that referenced this pull request Oct 26, 2022
Not actually. There is definitely still a dependency, but the "test
setup" part can be re-used with Nexus tests without causing any real
circular dependency issues.

See #1835 (comment)
for an example of one such issue.

Frankly the rest of nexus-test-utils probably *should* continue to be
refactored such that there is no real dependency on Nexus; doing so
will reduce the risk of "mis-matched versions of Nexus" classes of
errors in the future.

This required the following changes:

- Move `nexus`'s config information into `common/src/nexus_config.rs`,
  so that it can be used by `nexus-test-utils`.
- Make `nexus-test-utils` operate on a generic version of `NexusServer`,
  exposed by a new crate named `nexus-test-interface`. This crate can
  contain any information used to operate Nexus during shared test
  setup, without actually requiring a full dependency on `nexus`.
- Note, adding this generic required making `ControlPlaneTestContext`
  generic, which ended up bubbling up to a lot of test setup functions,
  including the one used in the `#[nexus_test]` macro...
- ... So, make it possible to customize which version of `NexusServer`
  is used by `#[nexus_test]`. This means callers can supply their *own*
  version of `NexusServer`, either from their own crate (see: Nexus
  unit tests) or the version from the `omicron_nexus` crate (see: Nexus
  integration tests).
Not actually. There is definitely still a dependency, but the "test
setup" part can be re-used with Nexus tests without causing any real
circular dependency issues.

See #1835 (comment)
for an example of one such issue.

Frankly the rest of nexus-test-utils probably *should* continue to be
refactored such that there is no real dependency on Nexus; doing so
will reduce the risk of "mis-matched versions of Nexus" classes of
errors in the future.

This required the following changes:

- Move `nexus`'s config information into `common/src/nexus_config.rs`,
  so that it can be used by `nexus-test-utils`.
- Make `nexus-test-utils` operate on a generic version of `NexusServer`,
  exposed by a new crate named `nexus-test-interface`. This crate can
  contain any information used to operate Nexus during shared test
  setup, without actually requiring a full dependency on `nexus`.
- Note, adding this generic required making `ControlPlaneTestContext`
  generic, which ended up bubbling up to a lot of test setup functions,
  including the one used in the `#[nexus_test]` macro...
- ... So, make it possible to customize which version of `NexusServer`
  is used by `#[nexus_test]`. This means callers can supply their *own*
  version of `NexusServer`, either from their own crate (see: Nexus
  unit tests) or the version from the `omicron_nexus` crate (see: Nexus
  integration tests).
@smklein smklein force-pushed the disk-create-idempotency branch from a0e52c9 to e44438d Compare October 26, 2022 03:07
let params = new_test_params(&opctx, project_id);
let dag = create_saga_dag::<SagaDiskCreate>(params).unwrap();

for node in dag.get_nodes() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this latest update, we can simply iterate over the entire dag, rather than enumerating nodes of interest. This should be a lot more future-proof!

@smklein smklein enabled auto-merge (squash) November 2, 2022 15:56
@smklein smklein merged commit 04a8182 into main Nov 2, 2022
@smklein smklein deleted the disk-create-idempotency branch November 2, 2022 17:21
@davepacheco
Copy link
Collaborator

Sorry for not noticing this earlier, but I see that this just undid #1548 / #1549. I'm happy to re-release Steno and then can we get Omicron back on a crates.io dep?

@smklein
Copy link
Collaborator Author

smklein commented Nov 2, 2022

Sorry for not noticing this earlier, but I see that this just undid #1548 / #1549. I'm happy to re-release Steno and then can we get Omicron back on a crates.io dep?

Ah, yeah, I'm happy to switch this back to a versioned schema if you could release a new Steno rev?

smklein added a commit that referenced this pull request Nov 2, 2022
Builds on #1835 and oxidecomputer/steno#67

Adds a test for the "instance create" saga's unwind safety by:
- Using a saga which attempts to create a network interface, allocate an external IP, and attach a disk
- Failing after *every single intermediate node* in the saga
- Verifying that no intermediate resources remain after any of the attempted executions
@smklein smklein mentioned this pull request Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants