-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry build when RemoteActionFileSystem
encounters a missing digest
#25358
base: master
Are you sure you want to change the base?
Conversation
7d8cbef
to
9cb500c
Compare
RemoteActionFileSystem
encounters a missing digest
RemoteActionFileSystem
encounters a missing digestRemoteActionFileSystem
encounters a missing digest
@justinhorvitz Could you review the interaction with the action rewinding machinery, including the changes I had to make to |
@@ -639,9 +674,13 @@ private SpawnResult handleError( | |||
status = Status.EXECUTION_FAILED_CATASTROPHICALLY; | |||
detailedCode = FailureDetails.Spawn.Code.EXECUTION_FAILED; | |||
catastrophe = true; | |||
} else if (remoteCacheFailed) { | |||
} else if (BulkTransferException.allCausedByCacheNotFoundException(exception)) { | |||
// At this point, cache evictions that affect uploaded inputs have already been handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coeuvre This is a change in behavior, but I think it's for the better as it avoids retries that are very unlikely to succeed.
|
||
# Incremental build in toplevel build triggers remote cache eviction error | ||
# but Bazel doesn't automatically retry the build yet. | ||
# TODO: This documents the current behavior, but it's not intended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinhorvitz To fix this, I would need to thread information about lost inputs obtained in
bazel/src/main/java/com/google/devtools/build/lib/skyframe/CompletionFunction.java
Line 375 in 998e762
ensureToplevelArtifacts(env, importantArtifacts, inputMap); |
ImportantOutputHandler
. I think I roughly understand what that would require, but the two calls to informImportantOutputHandler
further above and the mentioning of error bubbling tell me that this is probably pretty difficult to get right. Do you have any advice for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I like the direction that you make build rewinding more similar to action rewinding.
If it is not too difficult to do, I would like you to split this PR into 3 PRs for easier reviews:
- A PR that overhauls the build rewinding mechanism.
- A PR that fixes build rewinding for jdeps.
- A PR that contains remaining changes in this PR that don't belong to above 2.
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionCache.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/util/DigestUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/rules/java/JavaCompileActionContext.java
Outdated
Show resolved
Hide resolved
@@ -921,6 +940,18 @@ protected void createHardLink(PathFragment linkPath, PathFragment originalPath) | |||
localFs.getPath(linkPath).createHardLink(getPath(originalPath)); | |||
} | |||
|
|||
public void checkForLostInputs(Action action) throws LostInputsActionExecutionException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing some dots, but can you explain, before this PR, why Bazel didn't rewind the build? (the CacheNotFoundError was ignored by the call sites?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not ignored, it resulted in a build failure due to an IOException
but didn't reach any of the special logic for recognizing retryable failures. Since it was difficult to figure out where to catch this, I went for the refactoring into action rewinding concepts.
7f44f95
to
d4de887
Compare
I split off #25396 with the refactoring, this PR is now stacked on it. |
12b24a3
to
f3fbba5
Compare
# Conflicts: # src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java
f3fbba5
to
29c67f1
Compare
Cache evictions encountered during reads of remote files in
RemoteActionFileSystem
now result in the build being retried when--experimental_remote_cache_eviction_retries
is set to a positive value (the default).This is enabled by implementing the
checkForLostInputs
method on theRemoteOutputService
, building on the refactoring performed in #25396.This change also adds a test case that demonstrates how top-level artifacts that have been cache evicted result in a build failure that isn't retried, which needs to be fixed by follow-up work.