-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use git hash as hashing function #12158
Comments
Are you saying that for a single running Bazel server you're seeing mutable inodes for the same files? Or are you taking a memory snapshot of a running Bazel server and deploying it elsewhere? It would be helpful if you could post a sequence of steps that you perform and what the slow one is. |
Each night we bake a new node image for our VMs which handle the build and tests phase with bazel. In our CI, each time we create a commit, a buildjob starts, allocates such a prebaked VM, builds and tests the product. Once a job starts, it first does a Is this correct? |
Oh, I think I understand now. The pre-baked /tmp/bazel includes Bazel's action cache, and the metadata there for your input artifacts includes the inodes from the original run of Bazel. Yes, this is a limitation of using the inode as a contents proxy. But I think the ctime in FileContentsProxy will also be an issue, right? And there's a big risk of incorrectly not re-executing an action that needs to be re-run. The way this is normally worked around is by making a (SHA256) "fast digest" of the file accessible via an xattr. Because you're already doing a git checkout, the digests are presumably known to git via This would be lying to Bazel because the "git hash" isn't a hashing scheme that Bazel knows about, so it would not work with remote execution. But it should work with remote caching, if all clients are using this hashing scheme, since Bazel would never notice. If setting such a hash does work except for remote execution, I would be interested in seeing if the Remote-exec folks would want to add this hashing scheme, because it seems like it would be very efficient for the common user case of checking out a git commit. |
Oh yes, true. I was not aware that git does not preserve modification time.
Presumably during the bazel analysis phase, right?
Would there be no other option to decide if files need rehasing? One that doesn't break remote execution? If the proposal of using What do you think? |
I'd be much more receptive to a feature request to broaden Bazel's library
of useful hashes than to one that makes Bazel vulnerable to correctness
bugs. Running `touch` on every file seems very hacky. Would you like to
repurpose this FR into one that allows Bazel to use the "git hash" as its
hashing function?
|
That sounds awesome! Yes! |
Remote exec team, what do you think of adding the hashing function used by git to the hashing functions supported by Bazel? (I know nothing about this process, so perhaps my question doesn't even make sense, apologies if so.) https://stackoverflow.com/questions/460297/git-finding-the-sha1-of-an-individual-file-in-the-index claims that it is just
Since users often have the git hashes available in constant time on their system, this might be quite helpful? They could set xattrs on their sources themselves and point Bazel to them using the new |
Both remote cache and remote executor are checking the hashing function. IIUC, git is using SHA1 which should be supported by Bazel. I didn't test but combining It's also fine to add a special "git hashing function". |
If you look at the link I provided, it says that it's SHA1, but on the file with $ git ls-files -s WORKSPACE So it would be a simple hash to add, but it's not exactly SHA1. @coeuvre do I need to do anything besides adding the hash as a permitted value of the @coeuvre can you explain how remote caching knows what hashing function local Bazel used? If Bazel reports to the remote cache that action with key |
For gRPC remote cache and remote execution, we check the supported hashing function of remote server by making a GetCapabilities call. If the hashing function used is not supported by remote server, we will exit abruptly. For DiskCache and HttpCache, we don't (can't) check the capabilities. Remote cache may support multiple hashing functions simultaneously. However, when uploading to remote cache, we only send the hashes without telling which hashing function was used. It's the implementation details of remote server for how to check the correctness of hashes. IMO, the API should allow us specify which hashing function was used to calculate the hash so server can check these effectively. Remote execution only support single hashing function. It's not easy to support other hashing function with the current design of REAPI. |
How can the cache server check hashes effectively? All it can do is (maybe) verify that this is a valid SHA1 hash. That wouldn't be able to differentiate between the "git hash" and actual SHA1. The server doesn't have access to any actual input files, right?
So you're saying that any remote execution server would have to separately add git-hash capabilities to work with this? If git-hash would work with remote caching, it sounds like it would still be useful enough in this situation to implement. And once that's done, perhaps some remote execution servers would start supporting it as well, as a separate effort. |
We upload BOTH hash and content to the cache server and it can recalculate the hash based on the content. The problem with current API is that we don't specify the hashing function used to calculate that hash -- cache server have to guess which one is used.
Yes, it is limited by the specification but server can provide different remote executor endpoints for each supported hashing function if they really want to.
The workflow for the remote caching is:
For HTTP/Disk remote caching to work with "git hash", I think adding the hash as a permitted value of the For gRPC remote caching to work, the "git hash" should be included in the cache server's capabilities. To let a cache server which support multiple hashing functions works effectively:
Adding multiple hashing function support is more complex and can be done later. |
@mihaigalos it sounds like if you're using an HTTP or disk-based remote cache, you should be able to prototype my suggestion, by doing the following: use
Then you can run However, if your remote cache is GRPC, we'd have to add support to it for git-hash before doing this, which I'm not qualified to do. You could still test the toy example by running Bazel without a remote cache and seeing if it can use these hashes. |
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team ( |
This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team ( |
Original title:
No evaluation of inode equality in FileContentsProxy
Description of the problem / feature request:
Feature requests: what underlying problem are you trying to solve with this feature?
What operating system are you running Bazel on?
What's the output of
bazel info release
?Any other information, logs, or outputs that you want to share?
The text was updated successfully, but these errors were encountered: