Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Builds without the Bytes should support remote output eviction #8250

Closed
buchgr opened this issue May 7, 2019 · 15 comments
Closed

Remote Builds without the Bytes should support remote output eviction #8250

buchgr opened this issue May 7, 2019 · 15 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@buchgr
Copy link
Contributor

buchgr commented May 7, 2019

This is part of #6862

A remote caching or execution system must not evict outputs during a build. That is an output of action 1 must still be available to action 1000. Before this feature Bazel would simply reupload an output that's been evicted from the remote system, but this is no longer possible.

@buchgr buchgr added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request labels May 7, 2019
@phb
Copy link

phb commented Jul 16, 2019

I'm trying to debug a situation where I keep getting the
"java.io.IOException: Failed to fetch file with hash '0fe19df8b8f8eb5f545f50e11f958bf37d27ab7da5c260a9bd2bc0ff6fb760a4' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds."
message.

I had my cache misconfigured when I started experimenting with this feature and I lost (evicted) a few keys. The message above seems to indicate that restarting a build should be enough to get the missing keys to be uploaded, but I can't get that to work, ie, restarting the build (from scratch or not) always fails the same way.

@buchgr
Copy link
Contributor Author

buchgr commented Jul 16, 2019

@phb it depends on your cache. If the remote cache in general ensures that an action is only served as a cache hit if all outputs exist then restarting the build should do the trick. However, if you use GCS or nginx as a cache that does not understand the relationship between actions and outputs then restarting won't help.

@phb
Copy link

phb commented Jul 16, 2019

@buchgr That's great to know. I'm assuming my cache falls in the dumb-category (I'm using the https://github.com/Asana/bazels3cache, waiting patiently for native support #4889 if that will ever happen)
What cache implementations are aware of the output relationships?

@keith
Copy link
Member

keith commented Jul 17, 2019

Related conversation: #8508 (comment)

@thii
Copy link
Member

thii commented May 27, 2020

If I’m understanding correctly, does this mean that Remote Builds without the Byes still works with a read-only cache? (The cache is populated elsewhere, and we don’t delete anything)

@coeuvre coeuvre added the P2 We'll consider working on this in future. (Assignee optional) label Dec 9, 2020
@coeuvre coeuvre self-assigned this Dec 9, 2020
@axeluhlig
Copy link

Hi @coeuvre any update on this issue? It's blocking us from using builds without the Bytes.

@coeuvre
Copy link
Member

coeuvre commented Mar 24, 2021

Still working on #12665. Maybe I can find time to work on this in Q2.

@axeluhlig
Copy link

Cool, thanks for the update!

@illicitonion
Copy link
Contributor

I had a go at implementing some support for this, specifically by allowing rewinding in more places, but ran into an issue I'd love to get some feedback on.

The issue I was trying to address was runfiles needing to be uploaded after eviction, having not been downloaded. A simple repro is to make a go_binary rule, build it remotely with --remote_download_minimal, flush the remote storage, edit the file being built as the go_binary, and build remotely again. This should result in a failure because Bazel tries to read one of the compiler helpers from the runfiles of the action, and throws a FileNotFoundException because it was never downloaded.

The shape of the fixed I tried to put together was:

  1. When building the Merkel tree if an input was a DerivedArtifact, store that DerivedArtifact in the MerkelTree.
  2. If the bulk upload fails with only FileNotFoundExceptions for artifacts with a known DerivedArtifact, throw a LostInputsExecException instead of a BulkTransferException.
  3. Allow the RemoteRetrier to propagate ExecExceptions.

This appeared to be working fine - the ActionRewindingStrategy kicked in and put together a RewindPlan, but this leads to a bazel crash because of graph inconsistency:

java.lang.IllegalStateException: Unexpected inconsistency: ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//:bin, config=BuildConfigurationValue.Key[9b3dc679644997acb2eb31acfb64ec07bfc5831a68166d756e3e8abdbcd3a8e4]}, actionIndex=0}, [ActionLookupData{actionLookupKey=ConfiguredTargetKey{label=@go_sdk_linux//:builder, config=BuildConfigurationValue.Key[2646f9c701b7ccc168560c0ef59608649f3ee184a5d3b8d438d5cede85e70cc4]}, actionIndex=1}], PARENT_FORCE_REBUILD_OF_CHILD
	at com.google.devtools.build.skyframe.GraphInconsistencyReceiver.lambda$static$0(GraphInconsistencyReceiver.java:41)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator.maybeHandleRestart(AbstractParallelEvaluator.java:872)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator.access$400(AbstractParallelEvaluator.java:71)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:570)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
ERRO exit status 37                               

I couldn't see any GraphInconsistencyRecievers that would allow for this rewinding to happen - any pointers for how to tell SkyFrame that rewinding is ok, or how to handle this, would be much appreciated...

/cc @coeuvre @janakdr

@janakdr
Copy link
Contributor

janakdr commented Jun 22, 2021 via email

@illicitonion
Copy link
Contributor

Thanks - as far as I can tell there's not much documentation for this corner, would it simply have to do nothing in noteInconsistencyAndMaybeThrow (and return true for restartPermitted), or would it need to do something more effect-ful?

@janakdr
Copy link
Contributor

janakdr commented Jun 22, 2021 via email

illicitonion added a commit to illicitonion/bazel that referenced this issue Jun 25, 2021
Steps to reproduce
------------------

1. Make a target like:
```
go_binary(
    name = "bin",
    srcs = ["main.go"],
)
```

2. Build it with remote execution and `--remote_download_toplevel`

3. Flush the remote storage (so that things like the Go builder are
evicted)

4. Modify main.go so that bazel sees a rebuild is needed.

5. Build again with remote execution

You should see a failure whose underlying cause is a
BulkTransferException caused by a FileNotFoundException.

Current status
--------------

This currently semi-works. The second build will fail, but if you do a
third build, it will succeed. I suspect this may be beacuse Skyframe
needs configuring to allow in-build restarts somewhere, but that's still
to be worked out.

There are also a smattering of TODOs across the commit, but none of
those are particularly scary or complicated :)

See bazelbuild#8250 (comment)
@ptarjan
Copy link
Contributor

ptarjan commented Apr 11, 2022

We are having this issue when we try to turn on --remote_download_minimal.

@coeuvre
Copy link
Member

coeuvre commented Jan 23, 2023

I have drafted a design doc and work can be tracked at #16660.

@coeuvre
Copy link
Member

coeuvre commented Jul 28, 2023

Remote output eviction is supported by Bazel@HEAD and will be supported by future releases (Bazel 7+):

  1. Bazel can clear staled state that prevents you from doing an increment build if remote output eviction happened during this build.
  2. --experimental_remote_cache_eviction_retries can be used to ask Bazel automatically retry the build upon remote cache eviction.
  3. --experimental_remote_cache_ttl can be used to tell Bazel the lifetime of a cache entry in the remote cache so that Bazel knows when to invalidate local reference. The default value is 3h.
  4. --experimental_remote_cache_lease_extension can be used to ask Bazel periodically refresh the TTL of cache entries that are still required by Bazel if your remote cache evict entries based on LRU strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

9 participants