Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git hash as hashing function #12158

Closed
mihaigalos opened this issue Sep 23, 2020 · 15 comments
Closed

Use git hash as hashing function #12158

mihaigalos opened this issue Sep 23, 2020 · 15 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) stale Issues or PRs that are stale (no activity for 30 days) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@mihaigalos
Copy link

mihaigalos commented Sep 23, 2020

Original title: No evaluation of inode equality in FileContentsProxy

Description of the problem / feature request:

No evaluation of inode equality in FileContentsProxy.

Feature requests: what underlying problem are you trying to solve with this feature?

Speed up analysis phase by not adding whole worskspace to second level digest check. (described here).

What operating system are you running Bazel on?

Ubuntu 18.04/20.04.

What's the output of bazel info release?

release 2.2.0

Any other information, logs, or outputs that you want to share?

We are compiling a large repo (several MLoC) and the analysis phase takes 1-2 minutes.
We are using several layers of caching, including bazel remote cache. To reduce the traffic to the remote cache, we are also reusing .cache/bazel intermediate files across runs (baked into a node image deployed onto VMs).
Since the inodes of files after a git pull and checkout are not the same across runs, analysis phase takes very long, since files are effectively rehashed.

@oquenchil oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process) untriaged labels Oct 5, 2020
@coeuvre coeuvre added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc and removed team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Dec 9, 2020
@janakdr
Copy link
Contributor

janakdr commented Dec 9, 2020

Are you saying that for a single running Bazel server you're seeing mutable inodes for the same files? Or are you taking a memory snapshot of a running Bazel server and deploying it elsewhere? It would be helpful if you could post a sequence of steps that you perform and what the slow one is.

@mihaigalos
Copy link
Author

mihaigalos commented Dec 10, 2020

Each night we bake a new node image for our VMs which handle the build and tests phase with bazel.
We do this by bazel build --output_base=/tmp/bazel and then putting that output base in the generated VM image.
The goal is to reduce the network traffic to the bazel remote caches: an artifact found in /tmp/bazel is not fetched from the remote cache.

In our CI, each time we create a commit, a buildjob starts, allocates such a prebaked VM, builds and tests the product.

Once a job starts, it first does a git pull and git checkout, before building and testing.
I assume this step to change the inode of each file, since the allocated inode does not match the one used when producing the /tmp/bazel_out output base, since we're talking about different VMs here.

Is this correct?

@janakdr
Copy link
Contributor

janakdr commented Dec 10, 2020

Oh, I think I understand now. The pre-baked /tmp/bazel includes Bazel's action cache, and the metadata there for your input artifacts includes the inodes from the original run of Bazel.

Yes, this is a limitation of using the inode as a contents proxy. But I think the ctime in FileContentsProxy will also be an issue, right? And there's a big risk of incorrectly not re-executing an action that needs to be re-run.

The way this is normally worked around is by making a (SHA256) "fast digest" of the file accessible via an xattr.

Because you're already doing a git checkout, the digests are presumably known to git via git ls-files. I wonder if you could set the xattr of each file to the git hash when you're doing the checkout.

This would be lying to Bazel because the "git hash" isn't a hashing scheme that Bazel knows about, so it would not work with remote execution. But it should work with remote caching, if all clients are using this hashing scheme, since Bazel would never notice.

If setting such a hash does work except for remote execution, I would be interested in seeing if the Remote-exec folks would want to add this hashing scheme, because it seems like it would be very efficient for the common user case of checking out a git commit.

@mihaigalos
Copy link
Author

Yes, this is a limitation of using the inode as a contents proxy. But I think the ctime in FileContentsProxy will also be an issue, right?

Oh yes, true. I was not aware that git does not preserve modification time.
How about resetting the timestamp of the files with the script presented here?

I wonder if you could set the xattr of each file to the git hash when you're doing the checkout.

Presumably during the bazel analysis phase, right?

This would be lying to Bazel because the "git hash" isn't a hashing scheme that Bazel knows about, so it would not work with remote execution.

Would there be no other option to decide if files need rehasing? One that doesn't break remote execution?
ctime+inode is very good for local development but clearly has its limitations, particular in the CI.

If the proposal of using touch -d described above is worthwhile, I guess it would make sense to have a flag for disabling inode check when invoking bazel.

What do you think?

@janakdr
Copy link
Contributor

janakdr commented Dec 10, 2020 via email

@mihaigalos
Copy link
Author

That sounds awesome! Yes!

@mihaigalos mihaigalos changed the title No evaluation of inode equality in FileContentsProxy Use git hash as hashing function Dec 10, 2020
@janakdr janakdr added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Dec 10, 2020
@janakdr
Copy link
Contributor

janakdr commented Dec 10, 2020

Remote exec team, what do you think of adding the hashing function used by git to the hashing functions supported by Bazel? (I know nothing about this process, so perhaps my question doesn't even make sense, apologies if so.)

https://stackoverflow.com/questions/460297/git-finding-the-sha1-of-an-individual-file-in-the-index claims that it is just

(echo -ne "blob `wc -c < $file`\0"; cat $file) | sha1sum

Since users often have the git hashes available in constant time on their system, this might be quite helpful? They could set xattrs on their sources themselves and point Bazel to them using the new --unix_digest_hash_attribute_name startup option.

@coeuvre
Copy link
Member

coeuvre commented Dec 14, 2020

This would be lying to Bazel because the "git hash" isn't a hashing scheme that Bazel knows about, so it would not work with remote execution. But it should work with remote caching, if all clients are using this hashing scheme, since Bazel would never notice.

Both remote cache and remote executor are checking the hashing function.

IIUC, git is using SHA1 which should be supported by Bazel. I didn't test but combining --unix_digest_hash_attribute_name with --hash_function=SHA1 should do the trick.

It's also fine to add a special "git hashing function".

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: feature request and removed type: support / not a bug (process) untriaged labels Dec 14, 2020
@janakdr
Copy link
Contributor

janakdr commented Dec 14, 2020

If you look at the link I provided, it says that it's SHA1, but on the file with blob <# of characters> prepended. For example:

$ git ls-files -s WORKSPACE
100644 a5f019caa2d7c6f654cd3264140bef359cc196ed 0 WORKSPACE
$ sha1sum WORKSPACE
4ea6257c2f760ef5c1a32cdd0f743d0cd7c55ad8 WORKSPACE
$ file=WORKSPACE ; (echo -ne "blob wc -c < $file\0"; cat $file) | sha1sum
a5f019caa2d7c6f654cd3264140bef359cc196ed -

So it would be a simple hash to add, but it's not exactly SHA1.

@coeuvre do I need to do anything besides adding the hash as a permitted value of the --hash_function flag? How does remote execution become aware of it?

@coeuvre can you explain how remote caching knows what hashing function local Bazel used? If Bazel reports to the remote cache that action with key abc operated on file with hash def to produce output file foo with some contents, how does the remote cache check that the input file really had hash def?

@coeuvre
Copy link
Member

coeuvre commented Dec 14, 2020

For gRPC remote cache and remote execution, we check the supported hashing function of remote server by making a GetCapabilities call. If the hashing function used is not supported by remote server, we will exit abruptly. For DiskCache and HttpCache, we don't (can't) check the capabilities.

Remote cache may support multiple hashing functions simultaneously. However, when uploading to remote cache, we only send the hashes without telling which hashing function was used. It's the implementation details of remote server for how to check the correctness of hashes. IMO, the API should allow us specify which hashing function was used to calculate the hash so server can check these effectively.

Remote execution only support single hashing function.

It's not easy to support other hashing function with the current design of REAPI.

@janakdr
Copy link
Contributor

janakdr commented Dec 14, 2020

Remote cache may support multiple hashing functions simultaneously. However, when uploading to remote cache, we only send the hashes without telling which hashing function was used. It's the implementation details of remote server for how to check the correctness of hashes. IMO, the API should allow us specify which hashing function was used to calculate the hash so server can check these effectively.

How can the cache server check hashes effectively? All it can do is (maybe) verify that this is a valid SHA1 hash. That wouldn't be able to differentiate between the "git hash" and actual SHA1. The server doesn't have access to any actual input files, right?

Remote execution only support single hashing function.
It's not easy to support other hashing function with the current design of REAPI.

So you're saying that any remote execution server would have to separately add git-hash capabilities to work with this?

If git-hash would work with remote caching, it sounds like it would still be useful enough in this situation to implement. And once that's done, perhaps some remote execution servers would start supporting it as well, as a separate effort.

@coeuvre
Copy link
Member

coeuvre commented Dec 15, 2020

The server doesn't have access to any actual input files, right?

We upload BOTH hash and content to the cache server and it can recalculate the hash based on the content. The problem with current API is that we don't specify the hashing function used to calculate that hash -- cache server have to guess which one is used.

So you're saying that any remote execution server would have to separately add git-hash capabilities to work with this?

Yes, it is limited by the specification but server can provide different remote executor endpoints for each supported hashing function if they really want to.

If git-hash would work with remote caching, it sounds like it would still be useful enough in this situation to implement. And once that's done, perhaps some remote execution servers would start supporting it as well, as a separate effort.

The workflow for the remote caching is:

  1. Determine the hashing function.
    • Users specify hashing function with --digest_function on startup.
    • The default one is SHA256.
  2. Get the digest of a file using the hashing function.
    • Users can inject digest into xattrs with --unix_digest_hash_attribute_name.
  3. Upload files along with hashes to remote cache server.
    • For gRPC remote cache, Bazel will first check whether the hashing function is supported by the cache server.
    • For HTTP/Disk remote cache, there is no checks.

For HTTP/Disk remote caching to work with "git hash", I think adding the hash as a permitted value of the --digest_function flag is enough. Just to make sure all the Bazel instances accessing that cache have the same --digest_function value.

For gRPC remote caching to work, the "git hash" should be included in the cache server's capabilities.

To let a cache server which support multiple hashing functions works effectively:

  1. Cache server provide different cache endpoints for each hashing function and one endpoint only support one corresponding hashing funciton.
  2. or change REAPI to allow the client specify which hashing function was used to calculate the hash.

Adding multiple hashing function support is more complex and can be done later.

@janakdr
Copy link
Contributor

janakdr commented Dec 21, 2020

@mihaigalos it sounds like if you're using an HTTP or disk-based remote cache, you should be able to prototype my suggestion, by doing the following: use git ls-files -s to construct a file that is valid output of getfattr -R -d and consequently valid input for setfattr --restore, like:

# file: first/file
user.digest.git_hash=<hash-of-first-file>
# file: second/file
user.digest.git_hash=<hash-of-second-file>

Then you can run setfattr --restore=<file> and that should give all of your files the proper xattr. Finally, you can run bazel --unix_digest_hash_attribute_name=user.digest_git_hash --hash_function=SHA1 test ....

However, if your remote cache is GRPC, we'd have to add support to it for git-hash before doing this, which I'm not qualified to do. You could still test the toy example by running Bazel without a remote cache and seeing if it can use these hashes.

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Apr 28, 2023
@github-actions
Copy link

This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage). Thanks!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) stale Issues or PRs that are stale (no activity for 30 days) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

4 participants