Optionally retry failed tests #19760

lilatomic · 2023-09-05T02:05:01Z

This MR adds the ability to rerun Tests. The implementation adds a field on Process for the attempt number. This uniquifies the Process instance, which means that the Runtime Graph will not deduplicate it and will run it again. A wrapper RunProcessWithRetries is provided to rerun a normal process a number of attempts. TestResult is modified to have a tuple of FallibleProcessResults to contain the retries. The output of the test goal then aggregates these to create a "succeeded after k attempts" message, like:

✓ src/python/pants/core/goals/testomatic_pass_test.py:batch1 succeeded in 0.56s.
+ src/python/pants/core/goals/testomatic_test.py:batch0 succeeded after 4 attempts in 0.53s.

The implementation requires each backend to opt into retries of the test process. This MR only implements it for the pytest backend. Documentation describing RunProcessWithRetries and how to use it to retry tests is added.

The number of retries can be set with the [test].attempts_default config setting. This is currently global, but we could extend the machinery to act like the timeout field.

This implementation retries whole batches, and does not rebatch only failed targets. It also doesn't aggregate retried process metadata (currently not easy to get total run time, for example).

stuhood · 2023-09-05T18:05:48Z

Thanks a lot for working on this!

A few thoughts on this approach:

Rather than the test goal implementing the loop, I'd like to see it implemented below your unique input type (RunProcWithRetry), because that makes it much more reusable to other callers.
- So the RunProcWithRetry struct would gain an attempts field to control the number of attempts, and the implementing @rule would execute the loop.
I think that a unique output type might make sense too, because it seems very likely to me that some callers will want to inspect the failed runs as well (the test runner will probably want to render all of the attempts, likely by adjusting TestResult to wrap a compound return type, and then render all of the runs). The return type would essentially contain a list of FallibleProcessResults in attempt order.
Instead of rendering each attempt on its own line as you show in the description, it would probably be better if the test goal had a unique message for multiple attempts, and used a different sigil (probably sigil_succeeded_with_edits, which is yellow):
```
✓ src/python/pants/core/goals/testomatic_pass_test.py:batch1 succeeded in 0.58s.
+ src/python/pants/core/goals/testomatic_test.py:batch0 succeeded after three attempts in 0.56s.
```
Rather than an "extra" metadata field on Process, I would add a typed attempt: usize field. The only thing I can think of that is better off in an untyped string dict would be information which is not actually consumed by the process implementation machinery, and currently the only reason I could imagine adding additional information like that is retry.
Regarding the test infrastructure wanting to retry with smaller batches (or using pytest's only re-run failed tests facility): while those are definitely feasible, AFAICT, both of those approaches would result in changed inputs to the second attempt anyway, and so wouldn't need this facility.

Thanks again!

lilatomic · 2023-09-06T20:25:58Z

All good points!

Retry now occurs in run_proc_with_retry (makes total sense)
run_proc_with_retry outputs a ProcessResultWithRetries which wraps a list of FallibleProcessResult

output groups the attempts. as a bonus this fixes the exit code

✓ src/python/pants/core/goals/testomatic_pass_test.py:batch1 succeeded in 0.54s.
+ src/python/pants/core/goals/testomatic_test.py:batch0 succeeded after 2 attempts in 0.51s.

replaced the dict with an attempts field. I was thinking that there were a few things that seem to only be used in specific circumstances that could go there, like the execution_slot_variable, but this makes sense too
The tension for me is about how much opting-in is required. I was hoping that we could automatically retry tests without an opt-in from backends. Although your point that some backends would have their own (more optimised) retry mechanisms makes me think there's a better advantage to having backends opt-in

lilatomic · 2023-09-06T20:30:34Z

Also, I wasn't sure how completely you thought we should roll out TestResult containing all the process results. I thought to start with minimal invasiveness, so there's an extra parameter all_results: List[FallibleProcessResult] and some constructor methods accept that parameter as a option. This way no upstream needs to change/box. LMK if you envisioned the recommended way as supporting retries, so we should implement keeping all results as the default.

stuhood

Thanks a lot!

stuhood · 2023-10-18T18:37:15Z

src/python/pants/backend/python/goals/pytest_runner.py

@@ -509,7 +510,11 @@ async def run_python_tests(
    setup = await Get(
        TestSetup, TestSetupRequest(batch.elements, batch.partition_metadata, is_debug=False)
    )
-    result = await Get(FallibleProcessResult, Process, setup.process)
+    HARDCODED_RETRY_COUNT = 5  # TODO: get from global option or batch


+1

And possibly provide it via the PyTestRequest.Batch if we can imagine a case where some batches would be using different numbers of retries?

I expect that retries should be disabled by default, but that we recommend enabling them in CI on https://www.pantsbuild.org/docs/using-pants-in-ci

I agree, it would be great to be able to flag individual test targets as extra-flaky and give them multiple retries. Similar to how timeouts are currently done. Could we defer that to a future MR, maybe at the same time as we add retries to the other test types?

If we land a hardcoded retry like this, we would have to fast follow tomorrow (or at least this week...?) to add the option. If you can commit to that, then yea.

now it pulls it from the global option attempts_default. I think this thread is tracking an older commit

src/python/pants/backend/python/goals/pytest_runner.py

stuhood · 2023-10-18T18:39:43Z

src/python/pants/core/goals/test.py

@@ -143,6 +145,7 @@ def from_fallible_process_result(
        xml_results: Snapshot | None = None,
        extra_output: Snapshot | None = None,
        log_extra_output: bool = False,
+        all_results: List[FallibleProcessResult] | None = None,


IMO, you should go ahead and deprecate the old argument in favor of this one (while updating all built in consumers). That would send a signal to anyone who has implemented test runners off-main, and give them a hint of what to do to add retry.

sounds good, done. So far I've just boxed the results. I think rolling out retries to all the other backends can be deferred.

stuhood · 2023-10-18T18:43:21Z

src/python/pants/core/goals/test.py

@@ -85,7 +85,7 @@ class TestResult(EngineAwareReturnType):
    # A None result_metadata indicates a backend that performs its own test discovery/selection
    # and either discovered no tests, or encounted an error, such as a compilation error, in
    # the attempt.
-    result_metadata: ProcessResultMetadata | None
+    result_metadata: ProcessResultMetadata | None  # TODO: Merge elapsed MS of all subproceses


IMO, it's ok not to do that as long as the text summary of the run makes it clear that the displayed runtime is only for the successful run. Although it could be nice to include both I suppose.

I agree it's not essential. I'd like to leave this here, I'd like to think about whether we might still want to gather or aggregate the ProcessResultMetadata.

src/python/pants/core/goals/test.py

stuhood · 2023-10-18T18:46:47Z

src/python/pants/core/goals/test.py

+    suffix = f"{elapsed_print}{source_desc}"
+    return f"{sigil} {result.description} {status}{attempt_msg} {suffix}."


Hm. I'm not sure why the suffix was broken out like this... might as well inline it below?

huh, yeah. I guess because both parts of suffix could be empty, so it might have been intended to omit the space in case there was nothing after it? (although it doesn't do that). Anyhow, easy enough to inline.

src/rust/engine/process_execution/src/lib.rs

stuhood

Looks great, thanks. A few nits, then feel free to merge!

stuhood · 2023-10-19T01:57:59Z

docs/markdown/Writing Plugins/rules-api/rules-api-process.md

+
+### Retries in case of failure
+
+A `Process` can be retried by wrapping it in a `RunProcWithRetry` and requesting a `ProcessResultWithRetries`. The last result, whether succeeded or failed, is available with the `last` parameter. For example, the following will allow for up to 5 attempts at running `my_process`:


Nit: Naming wise, I think that this should be named ProcessWithRetry for symmetry with Process.

stuhood · 2023-10-19T01:58:22Z

pants.toml

@@ -191,6 +191,7 @@ extra_env_vars = [
  "RUST_BACKTRACE=1",
 ]
 timeout_default = 60
+attempts_default = 3


This edit should be in pants.ci.toml rather than pants.toml (so that it only applies in CI).

stuhood · 2023-10-19T01:59:36Z

src/python/pants/core/goals/test.py

@@ -96,6 +96,8 @@ class TestResult(EngineAwareReturnType):
    extra_output: Snapshot | None = None
    # True if the core test rules should log that extra output was written.
    log_extra_output: bool = False
+    # All results including failed attempts
+    all_results: List[FallibleProcessResult] = field(default_factory=list)


This should probably be tuple[FallibleProcessResult] for immutability (and the default can just be tuple()).

stuhood · 2023-10-19T02:01:49Z

Please also update the PR description before merging.

uniquifies processes so they can be rerun

instead of having separate params for the last result and the

- tuples instead of lists for immutability - move retries to ci config - rename RunProcWithRetry->RunProcessWithRetries

stuhood requested review from huonw and thejcannon September 5, 2023 17:46

lilatomic force-pushed the feature/retry-tests-2 branch from 3c35c3d to eeb9fd1 Compare September 6, 2023 15:40

jriddy added the category:internal CI, fixes for not-yet-released features, etc. label Sep 6, 2023

lilatomic force-pushed the feature/retry-tests-2 branch 2 times, most recently from 26e3f50 to 7f269d5 Compare September 9, 2023 16:43

lilatomic force-pushed the feature/retry-tests-2 branch from 7f269d5 to 858f0b8 Compare October 18, 2023 18:11

stuhood approved these changes Oct 18, 2023

View reviewed changes

gauthamnair mentioned this pull request Oct 18, 2023

pants CI test retry #20053

Closed

lilatomic marked this pull request as ready for review October 19, 2023 00:34

stuhood approved these changes Oct 19, 2023

View reviewed changes

stuhood changed the title ~~[WIP] Retry timed-out tests~~ Optionally retry failed tests Oct 19, 2023

lilatomic added 14 commits October 18, 2023 22:45

add machinery for retrying

36205c1

add field for process extras

255a12c

uniquifies processes so they can be rerun

cleanup in retry

821f497

fix tests

c87340a

turn extra into attempt (only key used)

c9aa061

move retry into run_proc_with_retry

6e8225b

TestResult has holder for all test results

3c72e10

printout reports attempts and uses "+" sigil

0eb2f06

cleanup comments

a792905

TestResult smart constructors only accept list of results

8e67a1c

instead of having separate params for the last result and the

bit of doc on retrying processes

c637f10

add global flag for default number of test attempts

e55f539

document test retries for CI

9a8a4a8

rename all_results->process_results

4a8afed

polish

5c96eb1

- tuples instead of lists for immutability - move retries to ci config - rename RunProcWithRetry->RunProcessWithRetries

lilatomic force-pushed the feature/retry-tests-2 branch from 3cf8155 to 5c96eb1 Compare October 19, 2023 02:45

stuhood enabled auto-merge (squash) October 19, 2023 03:17

stuhood added category:new feature and removed category:internal CI, fixes for not-yet-released features, etc. labels Oct 19, 2023

stuhood merged commit 4d17647 into pantsbuild:main Oct 19, 2023

huonw mentioned this pull request Nov 7, 2023

Add generic test retry support #13505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally retry failed tests #19760

Optionally retry failed tests #19760

lilatomic commented Sep 5, 2023 •

edited

Loading

stuhood commented Sep 5, 2023 •

edited

Loading

lilatomic commented Sep 6, 2023

lilatomic commented Sep 6, 2023

stuhood left a comment

stuhood Oct 18, 2023

lilatomic Oct 19, 2023

stuhood Oct 19, 2023

lilatomic Oct 19, 2023

stuhood Oct 18, 2023

lilatomic Oct 19, 2023

stuhood Oct 18, 2023

lilatomic Oct 19, 2023

stuhood Oct 18, 2023

lilatomic Oct 18, 2023

stuhood left a comment

stuhood Oct 19, 2023

stuhood Oct 19, 2023

stuhood Oct 19, 2023

stuhood commented Oct 19, 2023

		suffix = f"{elapsed_print}{source_desc}"
		return f"{sigil} {result.description} {status}{attempt_msg} {suffix}."


		### Retries in case of failure

		A `Process` can be retried by wrapping it in a `RunProcWithRetry` and requesting a `ProcessResultWithRetries`. The last result, whether succeeded or failed, is available with the `last` parameter. For example, the following will allow for up to 5 attempts at running `my_process`:

Optionally retry failed tests #19760

Optionally retry failed tests #19760

Conversation

lilatomic commented Sep 5, 2023 • edited Loading

stuhood commented Sep 5, 2023 • edited Loading

lilatomic commented Sep 6, 2023

lilatomic commented Sep 6, 2023

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Oct 19, 2023

lilatomic commented Sep 5, 2023 •

edited

Loading

stuhood commented Sep 5, 2023 •

edited

Loading