Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally retry failed tests #19760

Merged
merged 15 commits into from
Oct 19, 2023
Merged

Conversation

lilatomic
Copy link
Contributor

@lilatomic lilatomic commented Sep 5, 2023

This MR adds the ability to rerun Tests. The implementation adds a field on Process for the attempt number. This uniquifies the Process instance, which means that the Runtime Graph will not deduplicate it and will run it again. A wrapper RunProcessWithRetries is provided to rerun a normal process a number of attempts. TestResult is modified to have a tuple of FallibleProcessResults to contain the retries. The output of the test goal then aggregates these to create a "succeeded after k attempts" message, like:

✓ src/python/pants/core/goals/testomatic_pass_test.py:batch1 succeeded in 0.56s.
+ src/python/pants/core/goals/testomatic_test.py:batch0 succeeded after 4 attempts in 0.53s.

The implementation requires each backend to opt into retries of the test process. This MR only implements it for the pytest backend. Documentation describing RunProcessWithRetries and how to use it to retry tests is added.

The number of retries can be set with the [test].attempts_default config setting. This is currently global, but we could extend the machinery to act like the timeout field.

This implementation retries whole batches, and does not rebatch only failed targets. It also doesn't aggregate retried process metadata (currently not easy to get total run time, for example).

@stuhood stuhood requested review from huonw and thejcannon September 5, 2023 17:46
@stuhood
Copy link
Member

stuhood commented Sep 5, 2023

Thanks a lot for working on this!

A few thoughts on this approach:

  • Rather than the test goal implementing the loop, I'd like to see it implemented below your unique input type (RunProcWithRetry), because that makes it much more reusable to other callers.
    • So the RunProcWithRetry struct would gain an attempts field to control the number of attempts, and the implementing @rule would execute the loop.
  • I think that a unique output type might make sense too, because it seems very likely to me that some callers will want to inspect the failed runs as well (the test runner will probably want to render all of the attempts, likely by adjusting TestResult to wrap a compound return type, and then render all of the runs). The return type would essentially contain a list of FallibleProcessResults in attempt order.
  • Instead of rendering each attempt on its own line as you show in the description, it would probably be better if the test goal had a unique message for multiple attempts, and used a different sigil (probably sigil_succeeded_with_edits, which is yellow):
    ✓ src/python/pants/core/goals/testomatic_pass_test.py:batch1 succeeded in 0.58s.
    + src/python/pants/core/goals/testomatic_test.py:batch0 succeeded after three attempts in 0.56s.
    
  • Rather than an "extra" metadata field on Process, I would add a typed attempt: usize field. The only thing I can think of that is better off in an untyped string dict would be information which is not actually consumed by the process implementation machinery, and currently the only reason I could imagine adding additional information like that is retry.
  • Regarding the test infrastructure wanting to retry with smaller batches (or using pytest's only re-run failed tests facility): while those are definitely feasible, AFAICT, both of those approaches would result in changed inputs to the second attempt anyway, and so wouldn't need this facility.

Thanks again!

@lilatomic lilatomic force-pushed the feature/retry-tests-2 branch from 3c35c3d to eeb9fd1 Compare September 6, 2023 15:40
@jriddy jriddy added the category:internal CI, fixes for not-yet-released features, etc. label Sep 6, 2023
@lilatomic
Copy link
Contributor Author

All good points!

  • Retry now occurs in run_proc_with_retry (makes total sense)
  • run_proc_with_retry outputs a ProcessResultWithRetries which wraps a list of FallibleProcessResult
  • output groups the attempts. as a bonus this fixes the exit code
    ✓ src/python/pants/core/goals/testomatic_pass_test.py:batch1 succeeded in 0.54s.
    + src/python/pants/core/goals/testomatic_test.py:batch0 succeeded after 2 attempts in 0.51s.
    
  • replaced the dict with an attempts field. I was thinking that there were a few things that seem to only be used in specific circumstances that could go there, like the execution_slot_variable, but this makes sense too
  • The tension for me is about how much opting-in is required. I was hoping that we could automatically retry tests without an opt-in from backends. Although your point that some backends would have their own (more optimised) retry mechanisms makes me think there's a better advantage to having backends opt-in

@lilatomic
Copy link
Contributor Author

Also, I wasn't sure how completely you thought we should roll out TestResult containing all the process results. I thought to start with minimal invasiveness, so there's an extra parameter all_results: List[FallibleProcessResult] and some constructor methods accept that parameter as a option. This way no upstream needs to change/box. LMK if you envisioned the recommended way as supporting retries, so we should implement keeping all results as the default.

@lilatomic lilatomic force-pushed the feature/retry-tests-2 branch 2 times, most recently from 26e3f50 to 7f269d5 Compare September 9, 2023 16:43
@lilatomic lilatomic force-pushed the feature/retry-tests-2 branch from 7f269d5 to 858f0b8 Compare October 18, 2023 18:11
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@@ -509,7 +510,11 @@ async def run_python_tests(
setup = await Get(
TestSetup, TestSetupRequest(batch.elements, batch.partition_metadata, is_debug=False)
)
result = await Get(FallibleProcessResult, Process, setup.process)
HARDCODED_RETRY_COUNT = 5 # TODO: get from global option or batch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

And possibly provide it via the PyTestRequest.Batch if we can imagine a case where some batches would be using different numbers of retries?

I expect that retries should be disabled by default, but that we recommend enabling them in CI on https://www.pantsbuild.org/docs/using-pants-in-ci

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be great to be able to flag individual test targets as extra-flaky and give them multiple retries. Similar to how timeouts are currently done. Could we defer that to a future MR, maybe at the same time as we add retries to the other test types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we land a hardcoded retry like this, we would have to fast follow tomorrow (or at least this week...?) to add the option. If you can commit to that, then yea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it pulls it from the global option attempts_default. I think this thread is tracking an older commit

src/python/pants/backend/python/goals/pytest_runner.py Outdated Show resolved Hide resolved
@@ -143,6 +145,7 @@ def from_fallible_process_result(
xml_results: Snapshot | None = None,
extra_output: Snapshot | None = None,
log_extra_output: bool = False,
all_results: List[FallibleProcessResult] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, you should go ahead and deprecate the old argument in favor of this one (while updating all built in consumers). That would send a signal to anyone who has implemented test runners off-main, and give them a hint of what to do to add retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, done. So far I've just boxed the results. I think rolling out retries to all the other backends can be deferred.

@@ -85,7 +85,7 @@ class TestResult(EngineAwareReturnType):
# A None result_metadata indicates a backend that performs its own test discovery/selection
# and either discovered no tests, or encounted an error, such as a compilation error, in
# the attempt.
result_metadata: ProcessResultMetadata | None
result_metadata: ProcessResultMetadata | None # TODO: Merge elapsed MS of all subproceses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's ok not to do that as long as the text summary of the run makes it clear that the displayed runtime is only for the successful run. Although it could be nice to include both I suppose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not essential. I'd like to leave this here, I'd like to think about whether we might still want to gather or aggregate the ProcessResultMetadata.

src/python/pants/core/goals/test.py Outdated Show resolved Hide resolved
Comment on lines 1058 to 1059
suffix = f"{elapsed_print}{source_desc}"
return f"{sigil} {result.description} {status}{attempt_msg} {suffix}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'm not sure why the suffix was broken out like this... might as well inline it below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, yeah. I guess because both parts of suffix could be empty, so it might have been intended to omit the space in case there was nothing after it? (although it doesn't do that). Anyhow, easy enough to inline.

src/rust/engine/process_execution/src/lib.rs Show resolved Hide resolved
@lilatomic lilatomic marked this pull request as ready for review October 19, 2023 00:34
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks. A few nits, then feel free to merge!


### Retries in case of failure

A `Process` can be retried by wrapping it in a `RunProcWithRetry` and requesting a `ProcessResultWithRetries`. The last result, whether succeeded or failed, is available with the `last` parameter. For example, the following will allow for up to 5 attempts at running `my_process`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Naming wise, I think that this should be named ProcessWithRetry for symmetry with Process.

pants.toml Outdated
@@ -191,6 +191,7 @@ extra_env_vars = [
"RUST_BACKTRACE=1",
]
timeout_default = 60
attempts_default = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edit should be in pants.ci.toml rather than pants.toml (so that it only applies in CI).

@@ -96,6 +96,8 @@ class TestResult(EngineAwareReturnType):
extra_output: Snapshot | None = None
# True if the core test rules should log that extra output was written.
log_extra_output: bool = False
# All results including failed attempts
all_results: List[FallibleProcessResult] = field(default_factory=list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be tuple[FallibleProcessResult] for immutability (and the default can just be tuple()).

@stuhood stuhood changed the title [WIP] Retry timed-out tests Optionally retry failed tests Oct 19, 2023
@stuhood
Copy link
Member

stuhood commented Oct 19, 2023

Please also update the PR description before merging.

- tuples instead of lists for immutability
- move retries to ci config
- rename RunProcWithRetry->RunProcessWithRetries
@lilatomic lilatomic force-pushed the feature/retry-tests-2 branch from 3cf8155 to 5c96eb1 Compare October 19, 2023 02:45
@stuhood stuhood enabled auto-merge (squash) October 19, 2023 03:17
@stuhood stuhood added category:new feature and removed category:internal CI, fixes for not-yet-released features, etc. labels Oct 19, 2023
@stuhood stuhood merged commit 4d17647 into pantsbuild:main Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants