-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel remote cache is not a clear win #7664
Comments
Also 76370d5 has been implemented which could help with such cases. |
Ideally that'd use same/similar logic used by Dynamic Scheduling in Remote Execution https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html |
@Globegitter absolutely. In Bazel we'll probably limit dynamic scheduling to remote caching only (and disallow remote execution for safety reasons). @njlr thanks a lot for doing these benchmarks. We'll be landing #6862 this week in Bazel master probably and it'd love to run these benchmarks again with this change in. |
Will pick this up soon! |
Fantastic! I would be keen to re-run some benchmarks when you are ready 👍 |
Any updates on this? |
I'm not sure that we have enough information here to decide on a course of action. What's the reason for the cached case to be slower? Is that something that can be fixed? If it's "just" the network round-trip time, then it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible. However, I'm not sure how to handle cache writes. It is technically possible to make them async, but that'll make it difficult to report errors. |
@jmmv has some nice blog posts on this topic. |
Big +1 we would love to see this. I asked @jmmv about this very thing on twitter and he replied:
|
Same here. Huge +1 if we could get dynamic strategy for cache lookup/download. Right now we have to have developers flip on and off their remote cache based on their download speed. For some it changes based on the time of day because of shared internet resources. |
The motivation behind jongerrish@ request is stemmed from the fact that we download ~2GB of data from cache for our build which is heavily depending on your download speed. We also have a per-action breakdown comparing building with cache and no-cache and we can see that on machines with lower network speed it's clearly faster to build locally vs downloading it from cache. Another interesting data set is that around ~3000 actions are <1KB so when taking latency into account it's probably not even worth checking if they are present in the cache |
Looking for some implementation guidance for this feature... would it be reasonable to have a new mode where we register a RemoteSpawnStrategy() that takes a new class RemoteCacheSpawnRunner that is more or less just an adapter to a RemoteCache? @ulfjack @philwo @buchgr @jin similar how the existing remote execution strategy is built here: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteActionContextProvider.java#L110 |
I was thinking about that, but I'm not sure how to get the results written back to the cache. Right now, the interface requires both lookup and write to be done in the same, err, context. As much as I like the technical challenge here, can we first confirm that it's due to the lookup overhead? Did you try to increase the number of jobs to see if that helps to hide the latency? |
The CPU, RAM and Network are all maxed out during a Bazel run. On my machine it sometimes even dips into SWAP memory so increasing the number of jobs causes an OOM. Alternatively, we can try to experiment with increasing Regarding cache writes, first of all we don't seed remote cache from local builds. Instead we have a stable machine on CI that does that. So for v1 it's probably acceptable to not support upload to cache. Unless you're talking about local workspace writes? I didn't check the code but in general I'd assume only the "winning" SpawnRunner should be responsible for writing to cache at the very end. |
@nkoroste I'm afraid in that case this won't be much of a win. The proposal here is to trade CPU for latency while using additional threads - that's only going to be an improvement if you have extra CPU, and if you're almost running OOM, your overall build latency might be dominated by gc rather than network round-trip latency. AFAICT, |
Sorry on the delay on this, to add more context and visibility from some of the offline conversations: I'm not suggesting that increasing # of jobs and max connections will improve anything. In fact, we benchmarked various variations of those two flags and the performance is generally worse if you increase the numbers for these 2 flags. All I'm saying is that Bazel produces GBs of data, specially for Android builds, that are required to be downloaded from remote cache. This is obviously directly correlated with your network overall speed and latency. During a build with a high cache hit rate (85%+), for a big app, majority of the time is spent downloading bytes from cache while most of the machine's CPU/Ram are idle/free. With dynamic spawn strategy we can utilize some of the machines resources and reduce the number of bytes downloaded from cache to hopefully improve the overall build time for developers with bad network connection. In the meantime, we try to improve the android rules themselves to produce less unnecessary date that will help with this as well. See #11253 for example. |
Our profiling shows that the remote cache latency plays a huge part in build performance. It seems that the cache checking uses the same thread as the action runner, hence, a slow remote cache will slow down all the build actions. A simple approach to this issue is to introduce a dedicated thread pool for all remote cache interaction. |
FYI @coeuvre |
@ashi009 We have a thread pool for gRPC calls but are block waiting on the result inside the spawn runner. One thing we can certainly improve is to change remote spawn runner to the non-blocking fashion but I doubt that will improve the overall performance - it depends. If your build doesn't have actions that are waiting for available action execution thread (defined by That said, can you share your profiling setup and more profiling data? |
Totally missed the notification. Sure thing, but I need to do it privately. Our build target is a huge iOS app, that has over 10k source files. The majority of actions are compiling ObjC files, which depends on no preceding actions. The critical path converges at the linking action. We have increased |
Thanks for sharing your build shape. In this case, I think the performance will be improved once the features described by #13632 and #13632 (comment) are implemented. |
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team ( |
@bazelbuild/triage, I think this is still relevant... can we keep it open? |
Description of the
problem/ feature request:Applying a Bazel HTTP remote cache can make the build slower, depending on the project and artefacts being built.
I tried a few projects:
For the cache server, I tried both bazel-remote and my own Node.js server that I cobbled together. Both yielded similar results.
The cache was hosted on a reasonable Digital Ocean box in the same city:
Internet connection speed for the client was around 10mbps, latency ~20ms. Not the fastest, but Bazel should be able to adapt to this.
The Bazel client was running on a fairly high-end laptop:
Suggestion
Perhaps the HTTP cache should record the time it took to build an artefact (according to the client). This would give Bazel enough information to decide if it is better to build or fetch.
Relevant variables:
Currently the server receives minimal metadata from Bazel.
What operating system are you running Bazel on?
Ubuntu 18.10
What's the output of
bazel info release
?Have you found anything relevant by searching the web?
Related discussion: #6091
The text was updated successfully, but these errors were encountered: