-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental flags to save and load the rule index between runs #1182
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linzhp nudged me a couple times to review this. Sorry for the long delay. I'm not an active maintainer (for now, maybe again soon) so take my feedback with a grain of salt.
I think we should land this PR (modulo comments). I'm generally cautious about code that relies on saved state that may be out of sync with reality. Gazelle treads a fine line here (build files of course may be out of sync with reality and Gazelle reads and writes them). That said, a feature like this is needed: I've heard several people say they've run into scaling problems with Gazelle because it takes a long time to index large repos.
I don't know exactly what the UX should look like. There's a good chance you'd want CI to generate the index files, triggered by CI, and Gazelle might need to fetch those. Or maybe it should query a service instead of dealing with JSON. goimports
inside Google does something like that.
In any case, I think it's a good idea to start out with JSON files and flags clearly marked as experimental so we can get a feel for it.
729449b
to
9257a30
Compare
9257a30
to
36023d0
Compare
@fmeum I finally got around to resurrecting this quite-old PR, which adds experimental support for saving and loading the rule index. This is intended to serve as a building block for other tooling (either company-specific or in an improved version of autogazelle) that understands what files have changed and executes Gazelle on only those directories. Solving that second problem is tricky especially in a fully general and provably correct sense, so I think it makes sense to leave it out of Gazelle for now. I left the new changes in a separate commit for easy reviewing, since @jayconrod looked at an earlier draft of this pretty thoroughly a year ago. |
FWIW, there was an excellent talk on BUILD file generation at Google at Bazelcon 2023 https://youtu.be/IXimf4DCAoY?t=8020. Might be worth a watch. |
@jayconrod Would you be available to give this a final look? I'm not too familiar with this part of the code, but can look into it if you are busy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow response, I'm finding new responsibilities lately that involve a lot more meetings :) Couple comments and thoughts below. Generally I'm in favor of this though, especially as it's marked experimental and can change.
|
||
ruleIndex := resolve.NewRuleIndex(mrslv.Resolver, exts...) | ||
if uc.readIndexFile != "" { | ||
exclude := walk.NewUpdateFilter(c.RepoRoot, uc.dirs, uc.walkMode).ShouldIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand why we're excluding parts of the index when reading.
Previously, we decided whether to index directories based on a mix of -r
, -index
, and the command line arguments. If you wanted to update build files in a specific directory, you'd still usually need to walk the entire repo in order to build the index in memory (unless -index=false
was explicitly set).
I think a major advantage of having an index file is that you can do fast updates in specific directories, for example, when saving a file. You can still have the index for the entire repo but you don't have to walk the full directory tree. This seems though like we're only loading parts of the index for directories being updated.
// | ||
// This file is not expected to have a stable format between Gazelle versions, but | ||
// a best-effort attempt is made to keep it sorted so it _should_ remain stable | ||
// between runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's comment for both WriteToFile
and ReadToFile
that these functions are experimental, and users outside Gazelle should not rely on them.
I'd also suggest not committing to JSON in the doc comment. This may not scale for very large repos: we might not want to read the file into memory. We might want to switch to SQLite or something similar. JSON is good for initial experimentation obviously though.
We have a "incremental index" mode in Uber: when users run "gazelle ", Gazelle updates (with help of a global index file) and indexes all build targets in and its subdirectories. All the entries in the global index file under is also updated. So both index and build files can be incrementally updated. We never need to run Gazelle from root to populate the index. In fact, that's not possible in Git sparse checkout. Our "index file" is actually a long list of |
@eric-skydio has any more thought been put into this? Are you using this in your repo or is this poc abandoned? |
**What type of PR is this?** refactor **What package or component does this PR mostly affect?** all **What does this PR do? Why is it needed?** Refactor working toward serializing+caching of the rule index. The `ruleRecord` should be as close as possible to what a language returns. All computed data such as that for indexing should be computed from the `ruleRecord` and persisted elsewhere. **Which issues(s) does this PR fix?** Ref: #1181 Some ideas are similar to #1182 while trying to reduce the scope of the change and diff. **Other notes for review** IMO this is a reasonable cleanup on its own, without even taking serializing+caching into account 🤷
What type of PR is this?
Feature
What package or component does this PR mostly affect?
resolve/index.go
What does this PR do? Why is it needed?
This PR implements experimental flags to save and load the Gazelle rule index between runs. When Gazelle is given an index file to load, it skips traversing the entire repo, and instead uses the cached data in the provided index.
To make this work, a few things needed to change:
buildUpdateRelMap
from walk.go has been refactored into a helper object, so that it can also be used to filter the index as it is loaded.Which issues(s) does this PR fix?
Fixes #1181
Other notes for review
Removing the.Finish()
method from the RuleIndex is technically a breaking change for plugin authors, although a pretty minor one that will probably only affect test code. We could leave it in and make it a no-op if this is a concern.index.go
cannot depend onwalk.go
due to forming an import cycle, so the exclude list is passed as a function. This seems to work fine.