Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create common scaffolding for saga undo tests #3904

Merged
merged 9 commits into from
Aug 24, 2023

Conversation

gjcolombo
Copy link
Contributor

Several of our saga undo tests have previously been found to have defects in
which the test verifies that all attempts to execute a saga fail but doesn't
verify that saga executions fail at the expected failure point. This loses test
coverage, since the tests don't actually execute all of the undo actions of the
saga under test.

To try to prevent this problem, add a set of saga test helpers that provide
common logic for writing undo tests. The scaffold functions handle the business
of constructing a DAG, deciding where to inject an error, and verifying that
errors occur in the right places. The callers provide a saga type and factory
functions that run before and after each saga execution to set up state, provide
saga parameters, check invariants, and clean up after each test iteration.

Convert existing tests of these types to use the new scaffolds. This revealed
that the no-pantry version of the disk snapshot test has a similar bug: the saga
requires the disk being snapshotted to be attached to an instance, but the test
wasn't creating an instance, so the saga never got past its "look up the
instance that owns the disk" step. Fix this issue.

Finally, coalesce some common instance operations (start, stop, delete,
simulate) that were used by multiple saga tests into the test helpers.

This is a test-only change, tested both by running cargo test and finding no
errors and by introducing bugs into some sagas and verifying that tests run with
the scaffolds catch those bugs.

Fixes #3896.

The no-pantry flavor of the snapshot unwind test requires the test disk to be
attached to an instance. Make sure that happens.

Also add a bunch of instance manipulation functions to the test helpers to avoid
duplicating code between the migrate and snapshot sagas.
Tested by injecting a bug into the create saga's undo-is-idempotent test and
verifying that the new harness caught it.
@gjcolombo gjcolombo requested review from jmpesp and smklein August 18, 2023 20:49
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this -- looks great, with a couple comments about generate_params being a little complex below

Comment on lines 171 to 172
initial_params: S::Params,
generate_params: G,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother separating these arguments? Couldn't we just generate initial_params using generate_params?

(I can see that the snapshot test looks a bit more complex in this area, but most tests appear to simply call generate_params to create initial_params anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a somewhat clumsy attempt to require at compile time that the DAGs under test are the same as the DAG that was used to figure out how many nodes to iterate over. I haven't yet found an elegant way to do this; they all have some problems:

  1. Create a single DAG in the test and reuse it: this is incorrect if the parameters have to change each time, as in the instance create saga
  2. Create parameters for the "query" DAG using the pre-iteration callback: this is incorrect if the callback has side effects, which is allowed (the snapshot test uses them)
  3. Create a DAG in the caller and provide it to the harness: this lets the caller pass any SagaDag with the underlying saga type erased, though this can be mitigated somewhat by checking that the saga names are the same
  4. Make the caller pass an S::Params that's independent of the ones generated by the pre-iteration callback: this avoids problem (2) but, as you've noted, looks weird; we're creating fake parameters just to throw the generated DAG away

I ended up going with a different kludge in e5030eb:

  1. Take the DAG created by the first test iteration, count its nodes, and use those for loop bounds; in each iteration, assert that the expected node count didn't change

The downside with this method is that we have to do the loop indexing by hand and so are more vulnerable to off-by-one errors than we otherwise would be. But I think I've avoided them, and we get the nice benefits that (a) we know the DAG has the same length (if not the same structure) during each iteration, and (b) we can remove a bunch of useless parameter generation from the callers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this kludge more! It at least internalizes these details, rather than forcing them on the caller. And IMO the new argument name makes a lot of sense.

>(
nexus,
params,
|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is documented as:

/// - generate_params: A callback called at the beginning of each loop
/// iteration that returns a future that yields the saga parameters to use for
/// that loop iteration.

This seems to kinda be treading into the realm of after_saga, by actually modifying state.

IMO it's kinda surprising to have a "generate_params" function also act as a "test setup" function -- if we want this, should we update the name accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in e5030eb.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! One suggestion: it's (embarrassingly) occurring to me now that verify_clean_slate is different for each of these tests. I added a no_stuck_sagas check in nexus/src/app/sagas/disk_create.rs, and my suggestion is to move that from there into these new helpers so that all tested sagas are checked to see if unwinding failed.

@gjcolombo
Copy link
Contributor Author

I added a no_stuck_sagas check in nexus/src/app/sagas/disk_create.rs, and my suggestion is to move that from there into these new helpers so that all tested sagas are checked to see if unwinding failed.

Done in 92649a5. Thanks for the suggestion! I also looked at combining the verify_clean_slate functions more generally, but it wasn't obvious to me that that could be done very elegantly, since each module with such a routine has a different set of things that it's checking for.

I did take the liberty of combining all of the existing forward-idempotency tests in 04d3c9c; these are a lot simpler than the undo tests, but there's still no reason to have them copy-pasted into each saga module.

@gjcolombo gjcolombo merged commit 3412cb1 into main Aug 24, 2023
@gjcolombo gjcolombo deleted the gjcolombo/templated-saga-tests branch August 24, 2023 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

want generic framework for common saga tests
3 participants