Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry build when RemoteActionFileSystem encounters a missing digest #25358

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Feb 21, 2025

Cache evictions encountered during reads of remote files in RemoteActionFileSystem now result in the build being retried when --experimental_remote_cache_eviction_retries is set to a positive value (the default).

This is enabled by implementing the checkForLostInputs method on the RemoteOutputService, building on the refactoring performed in #25396.

This change also adds a test case that demonstrates how top-level artifacts that have been cache evicted result in a build failure that isn't retried, which needs to be fixed by follow-up work.

@fmeum fmeum force-pushed the build-rewinding-jdeps branch 4 times, most recently from 7d8cbef to 9cb500c Compare February 24, 2025 12:28
@fmeum fmeum changed the title Build rewinding jdeps Rewind builds when the RemoteActionFileSystem encounters a missing digest Feb 25, 2025
@fmeum fmeum changed the title Rewind builds when the RemoteActionFileSystem encounters a missing digest Retry build when RemoteActionFileSystem encounters a missing digest Feb 25, 2025
@fmeum fmeum marked this pull request as ready for review February 25, 2025 11:56
@fmeum fmeum requested review from a team and lberki as code owners February 25, 2025 11:56
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Rules-Java Issues for Java rules team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Feb 25, 2025
@fmeum fmeum removed the request for review from a team February 25, 2025 11:57
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Feb 25, 2025
@fmeum fmeum removed the request for review from lberki February 25, 2025 11:57
@fmeum
Copy link
Collaborator Author

fmeum commented Feb 25, 2025

@justinhorvitz Could you review the interaction with the action rewinding machinery, including the changes I had to make to RewindingTest?
@coeuvre Could you review the rest, i.e., the integration with build rewinding and the remote module?

@@ -639,9 +674,13 @@ private SpawnResult handleError(
status = Status.EXECUTION_FAILED_CATASTROPHICALLY;
detailedCode = FailureDetails.Spawn.Code.EXECUTION_FAILED;
catastrophe = true;
} else if (remoteCacheFailed) {
} else if (BulkTransferException.allCausedByCacheNotFoundException(exception)) {
// At this point, cache evictions that affect uploaded inputs have already been handled.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coeuvre This is a change in behavior, but I think it's for the better as it avoids retries that are very unlikely to succeed.


# Incremental build in toplevel build triggers remote cache eviction error
# but Bazel doesn't automatically retry the build yet.
# TODO: This documents the current behavior, but it's not intended.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinhorvitz To fix this, I would need to thread information about lost inputs obtained in

ensureToplevelArtifacts(env, importantArtifacts, inputMap);
through to the ImportantOutputHandler. I think I roughly understand what that would require, but the two calls to informImportantOutputHandler further above and the mentioning of error bubbling tell me that this is probably pretty difficult to get right. Do you have any advice for me?

Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I like the direction that you make build rewinding more similar to action rewinding.

If it is not too difficult to do, I would like you to split this PR into 3 PRs for easier reviews:

  1. A PR that overhauls the build rewinding mechanism.
  2. A PR that fixes build rewinding for jdeps.
  3. A PR that contains remaining changes in this PR that don't belong to above 2.

@@ -921,6 +940,18 @@ protected void createHardLink(PathFragment linkPath, PathFragment originalPath)
localFs.getPath(linkPath).createHardLink(getPath(originalPath));
}

public void checkForLostInputs(Action action) throws LostInputsActionExecutionException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing some dots, but can you explain, before this PR, why Bazel didn't rewind the build? (the CacheNotFoundError was ignored by the call sites?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ignored, it resulted in a build failure due to an IOException but didn't reach any of the special logic for recognizing retryable failures. Since it was difficult to figure out where to catch this, I went for the refactoring into action rewinding concepts.

@fmeum fmeum force-pushed the build-rewinding-jdeps branch 4 times, most recently from 7f44f95 to d4de887 Compare February 26, 2025 19:50
@fmeum
Copy link
Collaborator Author

fmeum commented Feb 26, 2025

I split off #25396 with the refactoring, this PR is now stacked on it.

@fmeum fmeum requested a review from coeuvre February 26, 2025 19:52
@fmeum fmeum force-pushed the build-rewinding-jdeps branch 3 times, most recently from 12b24a3 to f3fbba5 Compare February 27, 2025 18:17
@fmeum fmeum force-pushed the build-rewinding-jdeps branch from f3fbba5 to 29c67f1 Compare February 27, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-Java Issues for Java rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants