Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve caching by comparing file hashes as fallback for mtime and size #3821

Merged
merged 8 commits into from
Aug 19, 2023

Conversation

cdce8p
Copy link
Contributor

@cdce8p cdce8p commented Jul 29, 2023

Description

Rewrite and improve caching implementation to use file hashes as fallback for mtime and size comparisons. Especially for CI systems, comparing just based on mtime makes caching practically useless. With each new run and git checkout it changes and the cache will miss even if the file didn't change.

This PR adds a fallback to compare file hashes if the mtime changed to resolve that. It's not as fast, but still much faster than formatting the file outright. This approach is used by other tools as well, like mypy.
https://github.com/python/mypy/blob/v1.4.1/mypy/fswatcher.py#L80-L88

For the initial caching implementation comparing hashes was dismissed because the benefit was seen as not worth it (at least at first) #109 (comment). However, as mentioned above and later in the issue #109 (comment), it's necessary for CI systems.

A quick performance comparison for https://github.com/home-assistant/core, roughly 10.000 files, run with pre-commit and Github actions.

  • 23.7.0: ~3:30 - 4:00min
  • With PR: ~10s

Checklist - did you ...

  • Add an entry in CHANGES.md if necessary?
  • Add / update tests if necessary?
  • Add new / update outdated documentation?

@github-actions
Copy link

github-actions bot commented Jul 29, 2023

diff-shades reports zero changes comparing this PR (6e1a57b) to main (793c2b5).


What is this? | Workflow run | diff-shades documentation

@JelleZijlstra JelleZijlstra self-requested a review July 30, 2023 03:08
Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a nice feature!

src/black/cache.py Outdated Show resolved Hide resolved

with open(path, "rb") as fp:
data = fp.read()
return hashlib.sha256(data).hexdigest()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it should be fine to use sha1 here, which is about 1.4x faster for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sha256 works well for mypy. I would stick with it in this case.
https://github.com/python/mypy/blob/v1.5.1/mypy/util.py#L501-L510

src/black/cache.py Show resolved Hide resolved
src/black/cache.py Outdated Show resolved Hide resolved
* Skip hashing if file sizes don't match
* Use is_changed in filtered_cached
* Combine update and write
Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Looks good, but I'd like for Jelle to take a look before merging

) as write_cache:
cmd = [str(src), "--diff"]
if color:
cmd.append("--color")
invokeBlack(cmd)
cache_file = get_cache_file(mode)
assert cache_file.exists() is False
read_cache.assert_called_once()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behaviour from eagerly reading the cache. Doesn't seem like a big deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a result of initializing the cache with cache = Cache.read(mode). Technically it did one unnecessary read in some cases.

Changing it wasn't too difficult. I just pushed 85b4a91 to delay the cache read.

tests/test_black.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure I like this impl of delayed cache read, it seems too easy to forget to call cache.read(). I think I'd prefer one of:
a) make file_data a property that reads the cache file if self._file_data is not set
b) initialise file_data to None, so we get errors if we try to use the Cache without read-ing first
c) revert to before the last commit

@cdce8p
Copy link
Contributor Author

cdce8p commented Aug 15, 2023

Hmm, I'm not sure I like this impl of delayed cache read, it seems too easy to forget to call cache.read().

True, even though that is how the current implementation (on main) does it. Always initialize cache = {} and do read_cache(mode) if necessary. But I agree, it's easy to accidentally forget the read.

I think I'd prefer one of:
a) make file_data a property that reads the cache file if self._file_data is not set
b) initialise file_data to None, so we get errors if we try to use the Cache without read-ing first
c) revert to before the last commit

I don't like (a). Doing the read when accessing the file_data is not something one would expect. It should either be in the constructor or a separate method. That's one thing I liked about the initial implementation Cache.read(mode) as classmethod was quite explicit and you immediately knew what would happen.

I reverted the commit. Long term it might make sense to use the cache in --diff mode as well in which case it's always needed anyway.

@hauntsaninja
Copy link
Collaborator

@JelleZijlstra this looks good to me, but I'd like to wait until you have a chance to take a look!

src/black/cache.py Outdated Show resolved Hide resolved
return hashlib.sha256(data).hexdigest()

@staticmethod
def get_file_data(path: Path) -> FileData:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a global function? staticmethods often feel a bit useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it here as it helps to group these methods nicely. Obviously personal preference. Though, if you want me to change it, I can do that too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just leave it as is, thanks!

@hauntsaninja
Copy link
Collaborator

Hmm we'll see if #3843 fixes CI

@hauntsaninja hauntsaninja merged commit c6a031e into psf:main Aug 19, 2023
29 checks passed
@cdce8p cdce8p deleted the improve-caching branch August 19, 2023 07:56
@cdce8p
Copy link
Contributor Author

cdce8p commented Sep 7, 2023

Any indication when the next release will be? I wouldn't ask normally, but the change will improve CI times drastically for larger projects when combined with caching. Would love to start using it.

@JelleZijlstra
Copy link
Collaborator

It's been about two months, so I'll start the release process over the next few days.

hauntsaninja added a commit to hauntsaninja/black that referenced this pull request Dec 25, 2023
Fixes psf#4116

This logic was introduced in psf#3821, I believe as a result of copying
logic inside mypy that I think isn't relevant to Black
hauntsaninja added a commit that referenced this pull request Dec 28, 2023
Fixes #4116

This logic was introduced in #3821, I believe as a result of copying
logic inside mypy that I think isn't relevant to Black
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants