Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds without the Bytes fails on missing AC result #10880

Closed
ulrfa opened this issue Mar 1, 2020 · 20 comments
Closed

Builds without the Bytes fails on missing AC result #10880

ulrfa opened this issue Mar 1, 2020 · 20 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@ulrfa
Copy link
Contributor

ulrfa commented Mar 1, 2020

[Not to be confused with available AC result referencing missing blobs in CAS]

The original proposal is clear about this limitation and calls it "initially acceptable":

It's possible that the remote system and the cached state on the local system become out of sync and the design needs to be robust enough to handle it. The interesting case is when Bazel has metadata about a remote output file being cached but the file was evicted from the remote system. Whether or not a file exists remotely is relevant if an executing action declares it as an input file. The ideal user experience would arguably be for Bazel to re-execute the generating action of the evicted output and transitively all its dependants, however that will be a large change that's out of scope for this proposal and we argue that an acceptable initial behaviour is to fail with a meaningful error message, ask the user to run bazel clean and to re-run the command. We argue that this behaviour is initially acceptable because we expect that output files will only infrequently be evicted from a remote execution system.

However most discussions (e.g. in #8250) focus only on using a remote cache that is not returning AC result if blobs are missing in CAS. But that is NOT enough to adress the following scenario:

genrule(
    name = "a",
    srcs = ["a.in"],
    outs = ["a.out"],
    cmd = "cat $(SRCS) > $@",
)

genrule(
    name = "b",
    srcs = ["a.out", "b.in"],
    outs = ["b.out"],
    cmd = "cat $(SRCS) > $@",
)
# Prepare source files
echo a1 > a.in
echo b1 > b.in

# Populate remote cache (emulated with --disk_cache)
bazel clean;
bazel build :b --disk_cache=./diskcache

# Builds without the bytes is downloading b.out but not a.out.
bazel clean;
bazel build :b --remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files --disk_cache=./diskcache

# Remove both CAS and AC result from remote cache
rm -rf diskcache

# Trigger re-build that needs a.out (without bazel clean or bazel shutdown)
echo b2 > b.in

bazel build :b --remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files --disk_cache=./diskcache

Results in the following error:

ERROR: /home/ulrik/tmp/bazel_without_all_the_bytes_test/BUILD:10:1: Executing genrule //:b failed due to unexpected I/O exception: Failed to fetch file with hash '0111f7554519f7126c570c154b894f1fbcddf4faa126f6d644b974dab6c77411' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '0111f7554519f7126c570c154b894f1fbcddf4faa126f6d644b974dab6c77411' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:128)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:206)
at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:209)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:89)
at com.google.devtools.build.lib.actions.SpawnActionContext.beginExecution(SpawnActionContext.java:41)
at com.google.devtools.build.lib.exec.ProxySpawnActionContext.beginExecution(ProxySpawnActionContext.java:60)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:931)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1070)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1041)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:116)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:77)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:608)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:903)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:297)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

Example of scenarios where AC results could go missing:

  • Cache eviction.
  • Local --disk_cache storage cleared.
  • Remote cache down and replaced with other non-fully synchronized instance.
  • Load balancing between alternative cache instances, e.g. with asynhronous replication.
  • Reconfigured sharding.
  • Remote RAM cache restarted.
  • ...

I'm not happy with workarounds involving manual 'bazel clean', since I want users to trust the build system and not fall back to old habits of 'make clean'.

There have been debates about support for rewinding actions to resolve this. Are there any decision @buchgr and @ulfjack?

@irengrig irengrig added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Mar 6, 2020
@brentleyjones
Copy link
Contributor

brentleyjones commented Mar 25, 2020

We just ran into this as well (for the first time after a while using Bw/oB, so it was a bit confusing).

Edit: For clarification, the way we ran into it was switching from using a disk cache to a remote cache, without running bazel shutdown.

@brentleyjones
Copy link
Contributor

This has now caused us to stop using Bw/oB 😞

@brentleyjones
Copy link
Contributor

This still happens. One fix is running bazel clean before the build, which is unfortunate.

@coeuvre coeuvre self-assigned this Aug 10, 2020
@coeuvre coeuvre added type: bug P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Dec 9, 2020
@ulrfa
Copy link
Contributor Author

ulrfa commented Jan 11, 2021

Thanks for assigning to this @coeuvre! Are there any news?

@coeuvre
Copy link
Member

coeuvre commented Jan 12, 2021

No. I am currently working on other remote execution related things but fixing this issue with action rewinding is on my plan.

@ulrfa
Copy link
Contributor Author

ulrfa commented Jan 12, 2021

Thanks for the update!

@appuser999
Copy link

any update here?

@ulrfa
Copy link
Contributor Author

ulrfa commented Jan 21, 2022

I would be very happy if this can be solved, since it is blocking my organization from using bwtb.

@coeuvre, any news from your side?

@illicitonion, I notice you are working on https://github.com/illicitonion/bazel/commits/rewinding-bulk-upload-exceptions-5.0. What is the current state? Are you planning sending a pull request?

@illicitonion
Copy link
Contributor

@illicitonion, I notice you are working on https://github.com/illicitonion/bazel/commits/rewinding-bulk-upload-exceptions-5.0. What is the current state? Are you planning sending a pull request?

I have a PR open at #14126 that's been pending review for a few months - I will rebase it soon.

@brentleyjones
Copy link
Contributor

This is still a huge issue for using BwtB. Even if a remote cache correctly only returns AC results that have all CAS items in the cache, this can make using a remote cache with BwtB unusable.

@coeuvre
Copy link
Member

coeuvre commented Oct 19, 2022

Even if a remote cache correctly only returns AC results that have all CAS items in the cache,

I didn't get this. Can you share more details?

I still think the ultimate solution for this is action rewinding and it is still working in progress #14126 (comment).

However, if we look closer to the issue, there are two scenarios:

  1. The generating action and the consuming action are executed within the same invocation. i.e. the file is evicted during an invocation.
  2. The generating action was executed in one invocation. The consuming action is executed in a different invocation. i.e. the file is evicted between invocations.

In practice, I think 1 is rare (correct me if I am wrong) imaging how long on average an invocation would be. Remote cache should be able to not evict recently touched files for that long. If that happens, only action rewinding can save us.

For 2, I think there is a workaround. In the later invocations, when checking the skyframe/action/remote cache, Bazel could check whether the referenced files are actually available (via findMissingBlobs, and at least for RBE, this call marks referenced files recently used) and skip the cache if any of them is missing. So the generating action will be re-executed. To avoid doing the check too frequently (because it uses gRPC and involves network I/Os, hence slow), Bazel and remote cache should agree on a minimum lifetime for recently used entries.

@brentleyjones
Copy link
Contributor

brentleyjones commented Oct 19, 2022

It's number 2 we run into. And I 1000% agree Bazel should call fibdMissingBlobs. I believe it's an REAPI spec violation that it's currently not. Having that duration be user flag configurable would be best in my opinion. I know some Buildbarn AC checking optimizations even allow you to set a value the same way, so matching that on the Bazel side would be perfect (so being able to say something like "I know my cache will keep things around for at least 1 hour", or something like that), with a sane but small default value.

@coeuvre
Copy link
Member

coeuvre commented Oct 20, 2022

I will work on the fix for 2 in a few days.

@brentleyjones
Copy link
Contributor

Awesome! Btw, I believe the 6.0 cut is in 4 days.

@brentleyjones
Copy link
Contributor

@coeuvre Gentle ping to see where you are with this 🙏

@coeuvre
Copy link
Member

coeuvre commented Oct 26, 2022

Still exporting some internal code to BwoB and will look into this after that (probably next week).

@brentleyjones
Copy link
Contributor

@coeuvre Friendly ping on this 😊

@tjgq
Copy link
Contributor

tjgq commented Mar 15, 2023

This is a duplicate of #8250.

@tjgq tjgq closed this as not planned Won't fix, can't repro, duplicate, stale Mar 15, 2023
@brentleyjones
Copy link
Contributor

@tjgq Per this:

However most discussions (e.g. in #8250) focus only on using a remote cache that is not returning AC result if blobs are missing in CAS. But that is NOT enough to address the following scenario:

I don't agree that it's a duplicate. It's useful to keep this open, since I've been linking to it in various discussions, and it's only really fixed by the lease service.

@tjgq
Copy link
Contributor

tjgq commented Mar 15, 2023

Yeah, I meant to say that it will be fixed by the lease service, which doesn't really have a canonical issue (#8250 is the closest I could find).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants