Add experimental flags to save and load the rule index between runs #1182

eric-skydio · 2022-02-25T08:15:16Z

What type of PR is this?

Feature

What package or component does this PR mostly affect?

resolve/index.go

What does this PR do? Why is it needed?

This PR implements experimental flags to save and load the Gazelle rule index between runs. When Gazelle is given an index file to load, it skips traversing the entire repo, and instead uses the cached data in the provided index.

To make this work, a few things needed to change:

The index is now structured with (closer to) one entry per indexed rule, and doesn't retain references to the File or Rule object that created them. For this to work, resolution of embeds is performed lazily at index query time. This also has the effect of removing the need to explicitly finalize the index.
buildUpdateRelMap from walk.go has been refactored into a helper object, so that it can also be used to filter the index as it is loaded.
index.go now has a unit test.

Which issues(s) does this PR fix?

Fixes #1181

Other notes for review

Removing the .Finish() method from the RuleIndex is technically a breaking change for plugin authors, although a pretty minor one that will probably only affect test code. We could leave it in and make it a no-op if this is a concern.
index.go cannot depend on walk.go due to forming an import cycle, so the exclude list is passed as a function. This seems to work fine.

resolve/index_test.go

jayconrod

@linzhp nudged me a couple times to review this. Sorry for the long delay. I'm not an active maintainer (for now, maybe again soon) so take my feedback with a grain of salt.

I think we should land this PR (modulo comments). I'm generally cautious about code that relies on saved state that may be out of sync with reality. Gazelle treads a fine line here (build files of course may be out of sync with reality and Gazelle reads and writes them). That said, a feature like this is needed: I've heard several people say they've run into scaling problems with Gazelle because it takes a long time to index large repos.

I don't know exactly what the UX should look like. There's a good chance you'd want CI to generate the index files, triggered by CI, and Gazelle might need to fetch those. Or maybe it should query a service instead of dealing with JSON. goimports inside Google does something like that.

In any case, I think it's a good idea to start out with JSON files and flags clearly marked as experimental so we can get a feel for it.

cmd/gazelle/fix-update.go

resolve/index.go

resolve/index_test.go

eric-skydio · 2023-11-01T15:32:24Z

@fmeum I finally got around to resurrecting this quite-old PR, which adds experimental support for saving and loading the rule index. This is intended to serve as a building block for other tooling (either company-specific or in an improved version of autogazelle) that understands what files have changed and executes Gazelle on only those directories.

Solving that second problem is tricky especially in a fully general and provably correct sense, so I think it makes sense to leave it out of Gazelle for now.

I left the new changes in a separate commit for easy reviewing, since @jayconrod looked at an earlier draft of this pretty thoroughly a year ago.

sluongng · 2023-11-01T18:43:07Z

FWIW, there was an excellent talk on BUILD file generation at Google at Bazelcon 2023 https://youtu.be/IXimf4DCAoY?t=8020. Might be worth a watch.

fmeum · 2023-11-03T09:22:20Z

@jayconrod Would you be available to give this a final look? I'm not too familiar with this part of the code, but can look into it if you are busy.

jayconrod

Apologies for the slow response, I'm finding new responsibilities lately that involve a lot more meetings :) Couple comments and thoughts below. Generally I'm in favor of this though, especially as it's marked experimental and can change.

jayconrod · 2023-11-08T17:46:46Z

cmd/gazelle/fix-update.go

+
+	ruleIndex := resolve.NewRuleIndex(mrslv.Resolver, exts...)
+	if uc.readIndexFile != "" {
+		exclude := walk.NewUpdateFilter(c.RepoRoot, uc.dirs, uc.walkMode).ShouldIndex


Not sure I understand why we're excluding parts of the index when reading.

Previously, we decided whether to index directories based on a mix of -r, -index, and the command line arguments. If you wanted to update build files in a specific directory, you'd still usually need to walk the entire repo in order to build the index in memory (unless -index=false was explicitly set).

I think a major advantage of having an index file is that you can do fast updates in specific directories, for example, when saving a file. You can still have the index for the entire repo but you don't have to walk the full directory tree. This seems though like we're only loading parts of the index for directories being updated.

jayconrod · 2023-11-08T17:49:24Z

resolve/index.go

+//
+// This file is not expected to have a stable format between Gazelle versions, but
+// a best-effort attempt is made to keep it sorted so it _should_ remain stable
+// between runs.


Let's comment for both WriteToFile and ReadToFile that these functions are experimental, and users outside Gazelle should not rely on them.

I'd also suggest not committing to JSON in the doc comment. This may not scale for very large repos: we might not want to read the file into memory. We might want to switch to SQLite or something similar. JSON is good for initial experimentation obviously though.

linzhp · 2023-11-30T22:58:36Z

I don't know exactly what the UX should look like. There's a good chance you'd want CI to generate the index files, triggered by CI, and Gazelle might need to fetch those.

We have a "incremental index" mode in Uber: when users run "gazelle ", Gazelle updates (with help of a global index file) and indexes all build targets in and its subdirectories. All the entries in the global index file under is also updated. So both index and build files can be incrementally updated. We never need to run Gazelle from root to populate the index. In fact, that's not possible in Git sparse checkout.

Our "index file" is actually a long list of gazelle:resolve at the root BUILD.bazel file, which is not ideal, but I hope this PR would preserve the "incremental index" feature.

jbedard · 2024-08-12T19:19:20Z

@eric-skydio has any more thought been put into this? Are you using this in your repo or is this poc abandoned?

**What type of PR is this?** refactor **What package or component does this PR mostly affect?** all **What does this PR do? Why is it needed?** Refactor working toward serializing+caching of the rule index. The `ruleRecord` should be as close as possible to what a language returns. All computed data such as that for indexing should be computed from the `ruleRecord` and persisted elsewhere. **Which issues(s) does this PR fix?** Ref: #1181 Some ideas are similar to #1182 while trying to reduce the scope of the change and diff. **Other notes for review** IMO this is a reasonable cleanup on its own, without even taking serializing+caching into account 🤷

This was referenced Feb 25, 2022

Support saving and loading the rule index between Gazelle runs #1181

Open

Autogazelle: Add option to pass -index=true to Gazelle #761

Closed

resolve: support cross-language dependency resolution in extensions #771

Closed

linzhp requested a review from blico February 28, 2022 21:24

sluongng reviewed Nov 20, 2022

View reviewed changes

resolve/index_test.go Outdated Show resolved Hide resolved

jayconrod reviewed Nov 23, 2022

View reviewed changes

Add experimental flags to save and load the rule index between runs

6009d60

HALtheWise force-pushed the offline_index branch 4 times, most recently from 729449b to 9257a30 Compare November 1, 2023 15:20

Resolve review comments and add integration test

36023d0

HALtheWise force-pushed the offline_index branch from 9257a30 to 36023d0 Compare November 1, 2023 15:26

jayconrod reviewed Nov 8, 2023

View reviewed changes

jbedard mentioned this pull request Aug 16, 2024

refactor: compute transitive embeds during indexing #1868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental flags to save and load the rule index between runs #1182

Add experimental flags to save and load the rule index between runs #1182

eric-skydio commented Feb 25, 2022 •

edited

Loading

jayconrod left a comment

eric-skydio commented Nov 1, 2023

sluongng commented Nov 1, 2023

fmeum commented Nov 3, 2023

jayconrod left a comment

jayconrod Nov 8, 2023

jayconrod Nov 8, 2023

linzhp commented Nov 30, 2023

jbedard commented Aug 12, 2024

Add experimental flags to save and load the rule index between runs #1182

Are you sure you want to change the base?

Add experimental flags to save and load the rule index between runs #1182

Conversation

eric-skydio commented Feb 25, 2022 • edited Loading

jayconrod left a comment

Choose a reason for hiding this comment

eric-skydio commented Nov 1, 2023

sluongng commented Nov 1, 2023

fmeum commented Nov 3, 2023

jayconrod left a comment

Choose a reason for hiding this comment

jayconrod Nov 8, 2023

Choose a reason for hiding this comment

jayconrod Nov 8, 2023

Choose a reason for hiding this comment

linzhp commented Nov 30, 2023

jbedard commented Aug 12, 2024

eric-skydio commented Feb 25, 2022 •

edited

Loading