-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally retry failed tests #19760
Conversation
Thanks a lot for working on this! A few thoughts on this approach:
Thanks again! |
3c35c3d
to
eeb9fd1
Compare
All good points!
|
Also, I wasn't sure how completely you thought we should roll out TestResult containing all the process results. I thought to start with minimal invasiveness, so there's an extra parameter |
26e3f50
to
7f269d5
Compare
7f269d5
to
858f0b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
@@ -509,7 +510,11 @@ async def run_python_tests( | |||
setup = await Get( | |||
TestSetup, TestSetupRequest(batch.elements, batch.partition_metadata, is_debug=False) | |||
) | |||
result = await Get(FallibleProcessResult, Process, setup.process) | |||
HARDCODED_RETRY_COUNT = 5 # TODO: get from global option or batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
And possibly provide it via the PyTestRequest.Batch
if we can imagine a case where some batches would be using different numbers of retries?
I expect that retries should be disabled by default, but that we recommend enabling them in CI on https://www.pantsbuild.org/docs/using-pants-in-ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it would be great to be able to flag individual test targets as extra-flaky and give them multiple retries. Similar to how timeouts are currently done. Could we defer that to a future MR, maybe at the same time as we add retries to the other test types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we land a hardcoded retry like this, we would have to fast follow tomorrow (or at least this week...?) to add the option. If you can commit to that, then yea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now it pulls it from the global option attempts_default
. I think this thread is tracking an older commit
src/python/pants/core/goals/test.py
Outdated
@@ -143,6 +145,7 @@ def from_fallible_process_result( | |||
xml_results: Snapshot | None = None, | |||
extra_output: Snapshot | None = None, | |||
log_extra_output: bool = False, | |||
all_results: List[FallibleProcessResult] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, you should go ahead and deprecate the old argument in favor of this one (while updating all built in consumers). That would send a signal to anyone who has implemented test runners off-main
, and give them a hint of what to do to add retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, done. So far I've just boxed the results. I think rolling out retries to all the other backends can be deferred.
@@ -85,7 +85,7 @@ class TestResult(EngineAwareReturnType): | |||
# A None result_metadata indicates a backend that performs its own test discovery/selection | |||
# and either discovered no tests, or encounted an error, such as a compilation error, in | |||
# the attempt. | |||
result_metadata: ProcessResultMetadata | None | |||
result_metadata: ProcessResultMetadata | None # TODO: Merge elapsed MS of all subproceses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's ok not to do that as long as the text summary of the run makes it clear that the displayed runtime is only for the successful run. Although it could be nice to include both I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not essential. I'd like to leave this here, I'd like to think about whether we might still want to gather or aggregate the ProcessResultMetadata.
src/python/pants/core/goals/test.py
Outdated
suffix = f"{elapsed_print}{source_desc}" | ||
return f"{sigil} {result.description} {status}{attempt_msg} {suffix}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I'm not sure why the suffix
was broken out like this... might as well inline it below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, yeah. I guess because both parts of suffix could be empty, so it might have been intended to omit the space in case there was nothing after it? (although it doesn't do that). Anyhow, easy enough to inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks. A few nits, then feel free to merge!
|
||
### Retries in case of failure | ||
|
||
A `Process` can be retried by wrapping it in a `RunProcWithRetry` and requesting a `ProcessResultWithRetries`. The last result, whether succeeded or failed, is available with the `last` parameter. For example, the following will allow for up to 5 attempts at running `my_process`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Naming wise, I think that this should be named ProcessWithRetry
for symmetry with Process
.
pants.toml
Outdated
@@ -191,6 +191,7 @@ extra_env_vars = [ | |||
"RUST_BACKTRACE=1", | |||
] | |||
timeout_default = 60 | |||
attempts_default = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This edit should be in pants.ci.toml
rather than pants.toml
(so that it only applies in CI).
src/python/pants/core/goals/test.py
Outdated
@@ -96,6 +96,8 @@ class TestResult(EngineAwareReturnType): | |||
extra_output: Snapshot | None = None | |||
# True if the core test rules should log that extra output was written. | |||
log_extra_output: bool = False | |||
# All results including failed attempts | |||
all_results: List[FallibleProcessResult] = field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be tuple[FallibleProcessResult]
for immutability (and the default can just be tuple()
).
Please also update the PR description before merging. |
uniquifies processes so they can be rerun
instead of having separate params for the last result and the
3cf8155
to
5c96eb1
Compare
This MR adds the ability to rerun Tests. The implementation adds a field on Process for the attempt number. This uniquifies the Process instance, which means that the Runtime Graph will not deduplicate it and will run it again. A wrapper RunProcessWithRetries is provided to rerun a normal process a number of attempts.
TestResult
is modified to have a tuple ofFallibleProcessResult
s to contain the retries. The output of the test goal then aggregates these to create a "succeeded after k attempts" message, like:The implementation requires each backend to opt into retries of the test process. This MR only implements it for the pytest backend. Documentation describing RunProcessWithRetries and how to use it to retry tests is added.
The number of retries can be set with the
[test].attempts_default
config setting. This is currently global, but we could extend the machinery to act like the timeout field.This implementation retries whole batches, and does not rebatch only failed targets. It also doesn't aggregate retried process metadata (currently not easy to get total run time, for example).