-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR?: ctx.actions.symlink action time scales w/ size of input; should be constant #14125
Comments
Repeating what I said in #10702: Sadly, One possible solution is to generalize 32b0f5a to also short-circuit the filesystem traversal for symlinks in the local execution case. I know that this isn't trivial to do because I tried (and failed) to do that in an earlier iteration of that CL. |
@tjgq: you, @BalestraPatrick and I talked at BazelCon today about this, and you mentioned that we should be able to store the hash in metadata (or use the hash that is already in metadata?) to prevent having to follow the symlink and re-hash it. And if not, I suggested we could associate the hash with inodes, cache that, and do it that way instead. |
Yes, my thinking is that #16283 could be adapted to do this - specifically, by having the symlink action itself inject the metadata for the symlink artifact into skyframe, instead of doing it only for BwoB. (I did try this at one point, but ran into some google3-specific issues and would have to look into it again.) |
Hello! Just sending an update here that even in Bazel 6, our incremental build times are very much suffering from this issue. As you can see in the following annotated profile trace, over 50% of the build time is spent in Bazel's various steps:
All of the above steps account for more than 46s of this build out of 90s. I think this is a very common issue for all projects that have binaries that are in the order of hundreds of MBs (for us this binary is ~700MB). This is basically a blocking issue to adopt Bazel locally due to the overhead compared to other build systems. |
@coeuvre while we are working towards BwoB improvements for the next release, are there any workarounds that can help to improve the situation described by @BalestraPatrick in the last comment? |
@BalestraPatrick Can you share a repro that can produce similar profile with the one you shared above? I know there is a repro in #14125 (comment), but it is only for "actuallyCompleteAction" and you mentioned 2. For "actuallyCompleteAction", I believe most of the time was spend on calculating the digest of outputs (but I can be wrong with your specific case). I do have another "simple" improvement in my mind, let's whether I can create a PR for that. |
@coeuvre There was a repro case here (not sure if you meant to link to that). I updated the same repro case with a new commit to inflate the binary a bit more (still 10x less than our binary though) here. You can use |
I pushed another change to bring the binary up to the right size (over 700MB): https://app.buildbuddy.io/invocation/0374c0e7-c0a8-498c-9730-29e03fb8f920#timing. This clearly illustrates the duplicate time on symlinks. I'm still trying to repro the |
This is an interesting comment: bazel/src/main/java/com/google/devtools/build/lib/actions/ActionCacheChecker.java Lines 439 to 443 in 2ff87be
|
@brentleyjones I have created #17478. With that, the time for |
@coeuvre It works ❤️! Would this be able to get into 6.1 as well? This is such a huge improvement for us. |
I'm glad it works for you! Sure, I believe it's safe to cherrypick it into 6.1. |
@coeuvre Thank you so much! I tested this in our project and indeed the
As you can see, the brown and blue lines disappeared in the second build. Note: I had to cherry-pick your change locally to Bazel 6.0 because latest master causes our linking step to get into an infinite loop. I'll try to bisect exactly what change caused this. Edit: found the cause for that infinite loop. It's 7b4acfe that required this change in rules_apple that we don't have yet, so nothing to worry about. |
Thanks a lot, @coeuvre! works great for our codebase as well... |
The cost of symlink action scales with the size of input because Bazel re-calculates the digest of the output by following the symlink in `actuallyCompleteAction` (#14125). However, the re-calculation is redundant because the digest was already computed by Bazel when checking the outputs of the generating action. Bazel should be smart enough to reuse the result. There is a global cache in Bazel for digest computation. Symlink action didn't make use of the cache because it uses the path of symlink as key to look up the cache. This PR changes to use the path of input file (i.e. target path) to query the cache to avoid recalculation. For a large target (700MB), the time for symlink action is reduced from 2000ms to 1ms. Closes #17478. PiperOrigin-RevId: 509524641 Change-Id: Id3c9dc07d68758770c092f6307e2433dad40ba10
The cost of symlink action scales with the size of input because Bazel re-calculates the digest of the output by following the symlink in `actuallyCompleteAction` (#14125). However, the re-calculation is redundant because the digest was already computed by Bazel when checking the outputs of the generating action. Bazel should be smart enough to reuse the result. There is a global cache in Bazel for digest computation. Symlink action didn't make use of the cache because it uses the path of symlink as key to look up the cache. This PR changes to use the path of input file (i.e. target path) to query the cache to avoid recalculation. For a large target (700MB), the time for symlink action is reduced from 2000ms to 1ms. Closes #17478. PiperOrigin-RevId: 509524641 Change-Id: Id3c9dc07d68758770c092f6307e2433dad40ba10
For "action dependency checking", it's probably there are too many actions are checking the action cache. We have added option to throttle the action cache check. 3d29b2e @brentleyjones @BalestraPatrick If you are able to reproduce "action dependency checking" overhead, can you try whether that option improves it? |
@coeuvre Thank you! For us "action dependency checking" doesn't happen always, but I can try to use that flag and report back if I see an improvement. The other big (and last bottleneck in this series) is the bazel/src/main/java/com/google/devtools/build/lib/skyframe/SkyframeActionExecutor.java Line 1521 in 7f548fb
After adding some logging to that line, I see this:
The first is a binary and the second is a tree artifact with about 1000 files. Do you have any idea how we can either skip or make this part more performant? I see the docs mention this:
Does |
For the tree artifact case, this tree visiting logic is where the most time is spent: bazel/src/main/java/com/google/devtools/build/lib/skyframe/ActionMetadataHandler.java Lines 326 to 349 in 7f548fb
|
@BalestraPatrick Thanks for your investigation! As you may already know, Bazel tracks files with their content not their path to ensure correctness. For every output an action generates, Bazel must know its digest to correctly track it. Essentially,
and the time for computing digests scales with the number and size of outputs. Did you use local or remote execution? I am asking because for local execution, it's probably not much we can do except parallelize the computation. For remote execution, on the other hand, when Bazel gets the execution result, it is able to get digest for each output from the result (because remote worker computed them). However, currently, remote execution with all output mode doesn't use that information so For remote execution with toplevel or minimal (i.e. build without the bytes), Bazel already uses the pre-computed digests from remote worker so there shouldn't be overhead in
|
@coeuvre Unfortunately these specific actions have to always run locally (we use I've also noticed that possibly the digest computation is not reused across actions, is that correct? So for example you can see in my profile that the digest for the binary is computed once already (dark green line which takes 2s), and it likely takes again another 2s as part of the tree artifact computation. |
I see. We have the same issue internally for our iOS builds. I don't have a clear solution yet but I will think about it.
Yes, I believe paralleilizing will help. Also @meisterT and I was profiling the overhead in "action dependency checking" today and found that it could be improved with paralleilizing as well. Our plan is to bring Loom into Bazel and do the I/O optimization with virtual thread. This is going to happen around Q2 (but I can't promise though).
Bazel has a global cache for digest computation and the size is controlled by If Bazel cannot reuse it, something is definitely wrong. Otherwise, I can probably make the cache smarter to evict entries for small files more frequently. |
Ahh, |
@coeuvre Looks like |
Yes, the key is based on the path. I re-read your case above, IIUC, the binary is generated by one action, and then is copied to the tree artifact by another action -- in this case, the cache won't help because the paths are different. |
I spent some time on the idea of parallelizing digest computation. For my prototype, I was able to reduce the time of For computing digest for large files, if I read the code correctly, Bazel is using Java code to do the SHA256. One optimization could be using native code instead and utilizing SIMD (e.g. use |
For large file digest, might worth take a look at the conversations over at bazelbuild/remote-apis#235, which I think @EdSchouten is planning to implement into Bazel after the PR is merged over at remote-apis. |
@sluongng Thanks for the pointer! I didn't know PSHA2 is already made public. So my last point for large file digest is basically what is already discussed there. |
* Fix symlink file creation overhead The cost of symlink action scales with the size of input because Bazel re-calculates the digest of the output by following the symlink in `actuallyCompleteAction` (#14125). However, the re-calculation is redundant because the digest was already computed by Bazel when checking the outputs of the generating action. Bazel should be smart enough to reuse the result. There is a global cache in Bazel for digest computation. Symlink action didn't make use of the cache because it uses the path of symlink as key to look up the cache. This PR changes to use the path of input file (i.e. target path) to query the cache to avoid recalculation. For a large target (700MB), the time for symlink action is reduced from 2000ms to 1ms. Closes #17478. PiperOrigin-RevId: 509524641 Change-Id: Id3c9dc07d68758770c092f6307e2433dad40ba10 * Update ActionMetadataHandler.java * Create OutputPermissions.java --------- Co-authored-by: Chi Wang <[email protected]> Co-authored-by: keertk <[email protected]>
Seems the overhead might not be fully fixed @coeuvre? https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1677829662738059 |
To close the loop, the overhead for symlink is fixed (and verified) in 6.1. |
Linking #17009 for tracking the effect of parallelizing digest computation for tree artifacts. |
This makes it possible to use every core available for checksumming, which makes a huge difference for large tree artifacts. Fixes #17009. RELNOTES: None. PiperOrigin-RevId: 525085502 Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b
The symlink issue has been fixed since 6.1. Parallelizing digest computation is submitted. Closing. |
@chiragramani @BalestraPatrick Can y'all verify that 6.2.0 fully fixes this? |
I can bump our project to a version of Bazel that contains this change and report back. |
Description of the problem / feature request:
Action 1 generates a 2 GiB output,
2gb.out
. Creating the output takes .75s, but postprocessing (checksumming?) viaactuallyCompleteAction
takes 5sAction 2 is a
ctx.actions.symlink
withtarget_file = ctx.file.2gb_out
. Creating the output is seemingly instantaneous, but postprocessing viaactuallyCompleteAction
costs another 5s.Feature requests: what underlying problem are you trying to solve with this feature?
rules_pkg's pkg_zip deprecated the
out
attr and manages outputs internally w/ an implicitctx.actions.symlink
. That call tosymlink
adds 15s to my build's critical path.Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
defs.bzl
BUILD
What operating system are you running Bazel on?
Linux, CentOS 8
What's the output of
bazel info release
?release 4.2.1
andrelease 5.0.0-pre.20210929.1
(viaUSE_BAZEL_VERSION=rolling
)If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.n/a
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?n/a
Have you found anything relevant by searching the web?
No.
(Edit: Found issue #12158 and discovered undocumented options
digest_function
andunix_digest_hash_attribute_name
. Perhaps these options or code adjacent to them could be leveraged for this?)Any other information, logs, or outputs that you want to share?
actuallyCompleteAction
boils down to checksumming the target of the output symlink. Since Bazel already has a checksum for the input, I would hope that this value could be reused.copy_file
is optionally a wrapper forctx.actions.symlink
, so more users may encounter this than you'd think.The text was updated successfully, but these errors were encountered: