-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Make 'disk_create' saga unwind safe, add tests for it #1835
Conversation
I have good news; I'm working on a similar test for the |
I did this in #1843 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! It's great that you're (I think?) fixing a bunch of bugs by adding these undo actions and then testing them.
I did not really dig into the actual undo actions or those tests (focused on the other bits here) -- let me know if you'd like me to do that.
nexus/src/db/datastore/mod.rs
Outdated
@@ -122,7 +122,7 @@ impl DataStore { | |||
self.pool.pool() | |||
} | |||
|
|||
pub(super) async fn pool_authorized( | |||
pub async fn pool_authorized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about this...the point of the DataStore is to encapsulate the database with relatively well-defined, safe operations. The visibility change here seems like it would make it very easy for someone to inadvertently grab a database connection and start making raw database queries from the app layer (thinking this is the right thing to do, when it's not).
Would it make sense to add interfaces to the DataStore that expose what you want to use from the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it only looks like this is used for no_disk_records_exist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so this sorta sent me down a rabbit hole.
Between this, and the comment about the weird "sort-of-cyclic" relationship between nexus
and nexus-test-utils
below, I sorta tried to do a clean-up of test-helper code.
Basically: I wanted to be able to write test code that can do arbitrary DB queries, and call functions that let us poke into parts of Nexus' durable storage trivially. You bring up a great point, this interface being available for non-test purposes would be a clear break of functional boundaries, but IMO, requiring arbitrary, one-off Diesel queries to add boilerplate code to this interface would be a bummer.
So: to reconcile both these things, I'm making this a function that exists only within #[cfg(tests)]
.
This should be doable. A function that exists for a unit-tests, but which does not exist beyond that functional boundary (so it can't be used incorrectly in a real prod usage).
Should be. Anyway, when I first tried just making that switch, it broke the tests I added to disk_create.rs
- they were no longer able to access the test-only function, because they were actually integration tests - the version of omicron_nexus
being tested was actually the one exported from nexus-test-utils
, which means it's not considered the same as the version of the current crate, also omicron_nexus
, but built under #[cfg(test)]. When I originally tried writing the "saga tests" in disk_create.rs
, I wanted them to be co-located with the source, to make it easier to eye-ball test coverage, but it was the first test in Nexus to actually use #[nexus_test]
, which caused the cyclic dependency of nexus-test-utils
-> nexus
-> nexus-test-utils
, which caused the error described here. I wanted to re-use test setup code because my saga tests also wanted to validate state about simulated sled agents (and arguably in the future may want to check metrics information from Clickhouse).
To fix the tests, I had choices:
- Make the
disk_create.rs
test dependencies much smaller, not including some of {CRDB, Clickhouse, Sled Agents}, such that test setup isn't needed. This is possible, but it seems ill-advised -- the population step of Nexus is necessary forauthz
to function correctly, being able to poke at "sim sled agents" is necessary for all the Nexus HTTP requests to work (unless they could somehow be stubbed out, which seems like a pain), and doing so would also make it more difficult to monitor metrics later, if Clickhouse wasn't up. - Add interfaces to the datastore for test-only queries, keep this test as an integration test. I didn't want to go down this route, because it seems like it would either bloat the interface with lower quality (intended for tests!) non-production code, or raise the bar on the safety/concurrency considerations of test-only queries (which would make tests more difficult to write).
- Break the circular dependency the bad way. Duplicate all the code Nexus used to do setup between unit and integration tests by copying everything we need from
nexus-test-utils
intonexus
. This seems doable, but somewhat gnarly, especially since we're dealing with a lot of possible configurations and related services. Additionally, this feels a bit like "cause more problems for future us", which sucks. - Break the circular dependency in a better way. Basically, make
nexus-test-utils
mostly no longer depend onnexus
, such that the test actually can be a unit test, and so we actually can have#[cfg(test)]
functions, like this one. I did this.
Anyway, this did end up requiring a fair bit more code changes, but it seems worth it. We can now write unit tests for Nexus that exist side-by-side with complex saga code, and we have the ability to write more powerful unit tests that can access the DB with ease. Furthermore, it's a step in the road for fixing a circular dependency in the build graph.
The change needed to make this all happen is described here: a0e52c9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for sending you down a rabbit hole! Thanks for considering this problem. I think you're right that it's important.
I'm a little on the fence about this solution. This seems like a pretty big change just to be able to get a database handle for the tests. Do I understand right that there's only one impl of the new trait? As a developer I find this pattern kind of annoying -- it feels like a bunch of extra cognitive overhead that's exposed to all the integration tests, plus all the test-utils code. But maybe this is sufficiently hidden by the type
definitions you added to each of the integration test files, and hopefully people won't really need to think about the trait very much. I'm curious what anybody else thinks about this. My avoidance of this pattern might be idiosyncratic.
Did you consider the approach of having a DatastoreTestInterfaces trait and putting pool_for_tests
into that, similar to what we do for NexusTestInterfaces? I think that trait exists for a very similar purpose and there's some benefit to using the same pattern, I think. Although this is still a trait with only one impl, the only place that needs to know about it are the things that use this very specific interface (pool_for_tests()
), as opposed to all the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm not saying this approach is a non-starter or anything, and if urgency is more important here, we can just go ahead with it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In multi-crate environments, I think it makes sense to think of NexusServer
and NexusServer
, but built with #[cfg(test)]
, as two different structures. From the perspective of omicron-nexus
using nexus-test-utils
vs the integration tests using nexus-test-utils
, those really are two different structures, even with a single "impl" body.
I actually do really like the #[cfg(test)]
mechanism of wrapping test helpers, because of a few reasons:
- This code is never compiled into the real production binary, so it reduces the impact on the resulting binaries. If we want to add tons of diesel queries to test-only configurations, we 100% can, and the cost on the prod version of nexus should be non-existent.
- It makes a distinction between "interfaces usable by unit tests" vs "interfaces usable by integration tests". If integration tests need to be so invasive, they are probably better suited as unit tests.
- It makes it impossible to use these test interfaces from a non-test environment.
If we go down the route of "add a DatastoreTestInterfaces
, revert e44438d", I think I'll run into a problem:
nexus-test-utils
would still have a dependency on omicron-nexus
, so there would still be the issue that unit tests for nexus cannot just use crate::...
things, they would need to use nexus_test_utils::re_exported_nexus::*
to be able to interoperate with the integration test setup.
I think this quirkiness, especially for unit tests, is the most confusing piece here. It threw me off quite a bit when writing these tests, and took a fair amount of digging to sort through.
If there are test interfaces that we want to expose to integration tests, I think we can. Nothing prevents us from both using a generic test framework, and also exposing a DatastoreTestInterfaces
trait, but it seems preferable to limit that interface to "thing truly needed by integration tests, not just unit tests".
nexus/test-utils/src/lib.rs
Outdated
@@ -23,6 +23,9 @@ use std::path::Path; | |||
use std::time::Duration; | |||
use uuid::Uuid; | |||
|
|||
// Expose the version of Nexus used by the test utils. | |||
pub use omicron_nexus as nexus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the caller depended on omicron_nexus themselves, is there any real chance it could differ (since it would be a path dependency in the same repo)?
Not saying this is a bad pattern... now I wonder: should we do the same thing with sled agent instead of having omicron-nexus have a dev-dep on omicron-sled-agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the caller depended on omicron_nexus themselves, is there any real chance it could differ (since it would be a path dependency in the same repo)?
This is related to my comment here: #1835 (comment)
nexus-test-utils
, undernexus/test-utils
, has a weird cyclic relationship with theomicron-nexus
crate (note: It has this relationship independent of this PR).nexus-test-utils
depends onomicron-nexus
, butomicron-nexus
has adev-dependency
back onnexus-test-utils
.- These both appear to be path-based dependencies.
- If I don't use this version of nexus, and instead, in the
disk_create.rs
test, I attempt to use Nexus symbols directly, I get errors like the following:
error[E0308]: mismatched types
--> nexus/src/app/sagas/disk_create.rs:662:38
|
662 | let params = new_test_params(&opctx, project_id);
| --------------- ^^^^^^ expected struct `context::OpContext`, found struct `omicron_nexus::context::OpContext`
| |
| arguments to this function are incorrect
|
= note: struct `omicron_nexus::context::OpContext` and struct `context::OpContext` have similar names, but are actually distinct types
note: struct `omicron_nexus::context::OpContext` is defined in crate `omicron_nexus`
--> /home/smklein/repos/oxide/omicron/nexus/src/context.rs:223:1
|
223 | pub struct OpContext {
| ^^^^^^^^^^^^^^^^^^^^
note: struct `context::OpContext` is defined in the current crate
--> nexus/src/context.rs:223:1
|
223 | pub struct OpContext {
| ^^^^^^^^^^^^^^^^^^^^
note: function defined here
--> nexus/src/app/sagas/disk_create.rs:633:8
|
633 | fn new_test_params(opctx: &OpContext, project_id: Uuid) -> Params {
| ^^^^^^^^^^^^^^^ -----------------
So, for whatever reason, they can't be consolidated as a single type.
Not saying this is a bad pattern... now I wonder: should we do the same thing with sled agent instead of having omicron-nexus have a dev-dep on omicron-sled-agent?
This doesn't seem necessary for accessing the omicron-sled-agent
functions from the Nexus test. I'm hesitant to start exposing all the internals of nexus-test-utils
publicly though -- for any crate used by a Nexus test, that also happens to be in nexus-test-utils
, should I re-export it? There's a lot more than just the sled agent here; nexus-test-utils and nexus both have dependencies on:
- anyhow
- chrono
- dropshot
- headers
- http
- omicron-common
- oximeter
- oximeter-client
- oximeter-producer
- serde
- serde_json
- serde_urlencoded
- slog
- uuid
Should I also be re-exporting all these crates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely doesn't make sense to me to expose all of those. But does it not feel strange that we have to do this with just this one crate? This sort-of-cyclic relationship seems kind of fraught but I don't have any specific suggestion for improving it so 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the explanation above, surrounding e44438d . This circular dependency has been... not entirely removed, but reduced enough that I don't need this pub use
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome
nexus/src/db/datastore/mod.rs
Outdated
@@ -122,7 +122,7 @@ impl DataStore { | |||
self.pool.pool() | |||
} | |||
|
|||
pub(super) async fn pool_authorized( | |||
pub async fn pool_authorized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it only looks like this is used for no_disk_records_exist
.
Not actually. There is definitely still a dependency, but the "test setup" part can be re-used with Nexus tests without causing any real circular dependency issues. See #1835 (comment) for an example of one such issue. Frankly the rest of nexus-test-utils probably *should* continue to be refactored such that there is no real dependency on Nexus; doing so will reduce the risk of "mis-matched versions of Nexus" classes of errors in the future. This required the following changes: - Move `nexus`'s config information into `common/src/nexus_config.rs`, so that it can be used by `nexus-test-utils`. - Make `nexus-test-utils` operate on a generic version of `NexusServer`, exposed by a new crate named `nexus-test-interface`. This crate can contain any information used to operate Nexus during shared test setup, without actually requiring a full dependency on `nexus`. - Note, adding this generic required making `ControlPlaneTestContext` generic, which ended up bubbling up to a lot of test setup functions, including the one used in the `#[nexus_test]` macro... - ... So, make it possible to customize which version of `NexusServer` is used by `#[nexus_test]`. This means callers can supply their *own* version of `NexusServer`, either from their own crate (see: Nexus unit tests) or the version from the `omicron_nexus` crate (see: Nexus integration tests).
Not actually. There is definitely still a dependency, but the "test setup" part can be re-used with Nexus tests without causing any real circular dependency issues. See #1835 (comment) for an example of one such issue. Frankly the rest of nexus-test-utils probably *should* continue to be refactored such that there is no real dependency on Nexus; doing so will reduce the risk of "mis-matched versions of Nexus" classes of errors in the future. This required the following changes: - Move `nexus`'s config information into `common/src/nexus_config.rs`, so that it can be used by `nexus-test-utils`. - Make `nexus-test-utils` operate on a generic version of `NexusServer`, exposed by a new crate named `nexus-test-interface`. This crate can contain any information used to operate Nexus during shared test setup, without actually requiring a full dependency on `nexus`. - Note, adding this generic required making `ControlPlaneTestContext` generic, which ended up bubbling up to a lot of test setup functions, including the one used in the `#[nexus_test]` macro... - ... So, make it possible to customize which version of `NexusServer` is used by `#[nexus_test]`. This means callers can supply their *own* version of `NexusServer`, either from their own crate (see: Nexus unit tests) or the version from the `omicron_nexus` crate (see: Nexus integration tests).
a0e52c9
to
e44438d
Compare
let params = new_test_params(&opctx, project_id); | ||
let dag = create_saga_dag::<SagaDiskCreate>(params).unwrap(); | ||
|
||
for node in dag.get_nodes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this latest update, we can simply iterate over the entire dag, rather than enumerating nodes of interest. This should be a lot more future-proof!
Builds on #1835 and oxidecomputer/steno#67 Adds a test for the "instance create" saga's unwind safety by: - Using a saga which attempts to create a network interface, allocate an external IP, and attach a disk - Failing after *every single intermediate node* in the saga - Verifying that no intermediate resources remain after any of the attempted executions
What am I trying to do in this PR
What does this PR NOT do
Actually perform unit tests on Sagas. I originally tried to make this PR abstract the interface from "sagas" -> "nexus", so that the interfaces could be mocked out without actually spinning up a full integration test. This turned out to be really gnarly, since the coupling there is so tight. I think this would still be theoretically possible, but it would incur a large interface cost, since the separation between "parts of Nexus used by sagas" and "the rest of Nexus" is ambiguous.EDIT: I actually did make them unit tests; see: e44438dnexus-test-utils
crate - however, this basically meant re-implementing a good chunk of test utilities, for very little benefit.What does this PR do
The important bits
nexus/src/app/sagas/disk_create.rs
. A test has been added to inject errors in all the saga actions, and observe that the saga can be cleanly unwound. Note - this required adding multiple "undo" actions to pass!Okay, but other files got updated, why?
nexus/src/app/saga.rs
: To be able to inject errors into sagas, the saga DAG creation + execution code needed to be refactored.nexus/src/app/mod.rs
: To expose saga DAGs to tests, allowing for errors to be injected, modules needed to be made public.nexus/src/db/datastore/mod.rs
: To allow saga tests to query the database directly (checking for detritus), access to the connection pool has been madepub
.nexus/test-utils/src/lib.rs
: To allow the integration tests indisk_create.rs
to safely refer to the same version of Nexus used bynexus-test-utils
, the import of theomicron_nexus
module is re-exported.