Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental flags to save and load the rule index between runs #1182

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

eric-skydio
Copy link
Contributor

@eric-skydio eric-skydio commented Feb 25, 2022

What type of PR is this?

Feature

What package or component does this PR mostly affect?

resolve/index.go

What does this PR do? Why is it needed?

This PR implements experimental flags to save and load the Gazelle rule index between runs. When Gazelle is given an index file to load, it skips traversing the entire repo, and instead uses the cached data in the provided index.

To make this work, a few things needed to change:

  • The index is now structured with (closer to) one entry per indexed rule, and doesn't retain references to the File or Rule object that created them. For this to work, resolution of embeds is performed lazily at index query time. This also has the effect of removing the need to explicitly finalize the index.
  • buildUpdateRelMap from walk.go has been refactored into a helper object, so that it can also be used to filter the index as it is loaded.
  • index.go now has a unit test.

Which issues(s) does this PR fix?

Fixes #1181

Other notes for review

  • Removing the .Finish() method from the RuleIndex is technically a breaking change for plugin authors, although a pretty minor one that will probably only affect test code. We could leave it in and make it a no-op if this is a concern.
  • index.go cannot depend on walk.go due to forming an import cycle, so the exclude list is passed as a function. This seems to work fine.

resolve/index_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jayconrod jayconrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linzhp nudged me a couple times to review this. Sorry for the long delay. I'm not an active maintainer (for now, maybe again soon) so take my feedback with a grain of salt.

I think we should land this PR (modulo comments). I'm generally cautious about code that relies on saved state that may be out of sync with reality. Gazelle treads a fine line here (build files of course may be out of sync with reality and Gazelle reads and writes them). That said, a feature like this is needed: I've heard several people say they've run into scaling problems with Gazelle because it takes a long time to index large repos.

I don't know exactly what the UX should look like. There's a good chance you'd want CI to generate the index files, triggered by CI, and Gazelle might need to fetch those. Or maybe it should query a service instead of dealing with JSON. goimports inside Google does something like that.

In any case, I think it's a good idea to start out with JSON files and flags clearly marked as experimental so we can get a feel for it.

cmd/gazelle/fix-update.go Outdated Show resolved Hide resolved
resolve/index.go Show resolved Hide resolved
resolve/index.go Outdated Show resolved Hide resolved
resolve/index.go Outdated Show resolved Hide resolved
resolve/index.go Show resolved Hide resolved
resolve/index_test.go Show resolved Hide resolved
resolve/index_test.go Outdated Show resolved Hide resolved
resolve/index_test.go Outdated Show resolved Hide resolved
resolve/index_test.go Show resolved Hide resolved
resolve/index_test.go Show resolved Hide resolved
@HALtheWise HALtheWise force-pushed the offline_index branch 4 times, most recently from 729449b to 9257a30 Compare November 1, 2023 15:20
@eric-skydio
Copy link
Contributor Author

@fmeum I finally got around to resurrecting this quite-old PR, which adds experimental support for saving and loading the rule index. This is intended to serve as a building block for other tooling (either company-specific or in an improved version of autogazelle) that understands what files have changed and executes Gazelle on only those directories.

Solving that second problem is tricky especially in a fully general and provably correct sense, so I think it makes sense to leave it out of Gazelle for now.

I left the new changes in a separate commit for easy reviewing, since @jayconrod looked at an earlier draft of this pretty thoroughly a year ago.

@sluongng
Copy link
Contributor

sluongng commented Nov 1, 2023

FWIW, there was an excellent talk on BUILD file generation at Google at Bazelcon 2023 https://youtu.be/IXimf4DCAoY?t=8020. Might be worth a watch.

@fmeum
Copy link
Member

fmeum commented Nov 3, 2023

@jayconrod Would you be available to give this a final look? I'm not too familiar with this part of the code, but can look into it if you are busy.

Copy link
Contributor

@jayconrod jayconrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow response, I'm finding new responsibilities lately that involve a lot more meetings :) Couple comments and thoughts below. Generally I'm in favor of this though, especially as it's marked experimental and can change.


ruleIndex := resolve.NewRuleIndex(mrslv.Resolver, exts...)
if uc.readIndexFile != "" {
exclude := walk.NewUpdateFilter(c.RepoRoot, uc.dirs, uc.walkMode).ShouldIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why we're excluding parts of the index when reading.

Previously, we decided whether to index directories based on a mix of -r, -index, and the command line arguments. If you wanted to update build files in a specific directory, you'd still usually need to walk the entire repo in order to build the index in memory (unless -index=false was explicitly set).

I think a major advantage of having an index file is that you can do fast updates in specific directories, for example, when saving a file. You can still have the index for the entire repo but you don't have to walk the full directory tree. This seems though like we're only loading parts of the index for directories being updated.

//
// This file is not expected to have a stable format between Gazelle versions, but
// a best-effort attempt is made to keep it sorted so it _should_ remain stable
// between runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's comment for both WriteToFile and ReadToFile that these functions are experimental, and users outside Gazelle should not rely on them.

I'd also suggest not committing to JSON in the doc comment. This may not scale for very large repos: we might not want to read the file into memory. We might want to switch to SQLite or something similar. JSON is good for initial experimentation obviously though.

@linzhp
Copy link
Contributor

linzhp commented Nov 30, 2023

I don't know exactly what the UX should look like. There's a good chance you'd want CI to generate the index files, triggered by CI, and Gazelle might need to fetch those.

We have a "incremental index" mode in Uber: when users run "gazelle ", Gazelle updates (with help of a global index file) and indexes all build targets in and its subdirectories. All the entries in the global index file under is also updated. So both index and build files can be incrementally updated. We never need to run Gazelle from root to populate the index. In fact, that's not possible in Git sparse checkout.

Our "index file" is actually a long list of gazelle:resolve at the root BUILD.bazel file, which is not ideal, but I hope this PR would preserve the "incremental index" feature.

@jbedard
Copy link
Contributor

jbedard commented Aug 12, 2024

@eric-skydio has any more thought been put into this? Are you using this in your repo or is this poc abandoned?

fmeum pushed a commit that referenced this pull request Aug 18, 2024
**What type of PR is this?**
refactor

**What package or component does this PR mostly affect?**
all

**What does this PR do? Why is it needed?**
Refactor working toward serializing+caching of the rule index.

The `ruleRecord` should be as close as possible to what a language
returns. All computed data such as that for indexing should be computed
from the `ruleRecord` and persisted elsewhere.

**Which issues(s) does this PR fix?**

Ref: #1181
Some ideas are similar to #1182 while trying to reduce the scope of the
change and diff.

**Other notes for review**

IMO this is a reasonable cleanup on its own, without even taking
serializing+caching into account 🤷
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support saving and loading the rule index between Gazelle runs
6 participants