Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Owners rule #16563

Merged

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented Aug 17, 2022

Possibly will close #16122.

@stuhood found in #16122 (comment) that we were spending substantial time in Rust parsing the same globs in the stateless function matches_filespec.

This improves the situation by introducing FilespecsMatcher, which allows us to persist the globs parsing over time. Crucially, FilespecsMatcher eagerly parses the globs in its constructor, unlike PathGlobs currently being lazy. Then, we use a @memoized_method on each SourcesField instance to avoid ever creating the matcher more than necessary.

Before:

❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.562 s ±  0.153 s    [User: 7.354 s, System: 1.516 s]
  Range (min … max):    6.420 s …  6.878 s    10 runs

After:

❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.460 s ±  0.065 s    [User: 7.301 s, System: 1.447 s]
  Range (min … max):    6.340 s …  6.576 s    10 runs

(Note, we expect the performance to be better in repos with the shape described at #16122 (comment))

Relationship to PathGlobs

We also considered changing PathGlobs to eagerly parse, and then defining PathGlobs.matches(), rather than introducing the new type FilespecsMatcher.

I believe FilespecsMatcher is a clearer API: it is only used for in-memory checks if globs match paths, whereas PathGlobs is used for on-disk.

Notably, the Rust code only overlaps in how the exclude patterns are ignored. So, making this split allows us to simplify PathGlobs.

A downside of having separate types is it's harder to share the duplicate parsing of the exclude patterns. This seems fine, given that the include patterns are separate code paths.

(See #16564 for improving PathGlobs to also eagerly parse.)

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-build-wheels]

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]
@Eric-Arellano Eric-Arellano force-pushed the improve-perf-of-owners-filespec branch from 0356b15 to aa03cf3 Compare August 17, 2022 23:01
Comment on lines -286 to -288
fn parse_patterns_from_include(
include: &[PathGlobIncludeEntry],
) -> Result<Vec<glob::Pattern>, String> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined below

/// to check if it may optimize the snapshot subset operation on this tree,
/// or need to check for excluded files/directories.
///
pub fn maybe_is_parent_of_ignored_path(&self, path: &Path) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

Comment on lines -388 to -389
impl PyPathGlobs {
fn parse(self) -> PyResult<PreparedPathGlobs> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now unused

.0
.include_globs()
.iter()
.map(|pattern| format!("{pattern}"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuhood is there a better way to call Display on the value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern.to_string() calls Display.

}

fn matches(&self, paths: Vec<String>, py: Python) -> PyResult<Vec<String>> {
py.allow_threads(|| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuhood has suggested it may be more efficient to remove this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to do some microbenchmarking, yea.

The two cases are:

  1. taking paths: Vec<&str>, which would avoid copying the inputs, but would require us to hold the GIL, or
  2. keeping as is

You could maybe temporarily add a loop around the match calls in src/python/pants/source/filespec_test.py assert_rule_matches, and then compare that way. It's not a completely realistic use case, but it would be straightforward. I expect that it's not worth doing anything more involved though.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 354 to 356
/// owners detection of deleted files (see #6790 and #5636 for more info). The lazy filesystem
/// traversal in expand is (currently) too expensive to use for that in-memory matching (such as
/// via MemFS).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserving the explanation of why we have two implementations of matching (because the lazy impl is too slow for this usecase) would be good. It may not be true forever.

.0
.include_globs()
.iter()
.map(|pattern| format!("{pattern}"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern.to_string() calls Display.

}

fn matches(&self, paths: Vec<String>, py: Python) -> PyResult<Vec<String>> {
py.allow_threads(|| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to do some microbenchmarking, yea.

The two cases are:

  1. taking paths: Vec<&str>, which would avoid copying the inputs, but would require us to hold the GIL, or
  2. keeping as is

You could maybe temporarily add a loop around the match calls in src/python/pants/source/filespec_test.py assert_rule_matches, and then compare that way. It's not a completely realistic use case, but it would be straightforward. I expect that it's not worth doing anything more involved though.

Comment on lines +339 to +349
let includes = includes
.iter()
.map(|glob| {
PathGlob::normalize_pattern(glob).and_then(|components| {
let normalized_pattern: PathBuf = components.into_iter().collect();
Pattern::new(normalized_pattern.to_str().unwrap())
.map_err(|e| format!("Could not parse {:?} as a glob: {:?}", glob, e))
})
})
.collect::<Result<Vec<_>, String>>()?;
let excludes = GitignoreStyleExcludes::create(excludes)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these should be executed with the GIL released I think.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano Eric-Arellano merged commit d55493f into pantsbuild:main Aug 18, 2022
@Eric-Arellano Eric-Arellano deleted the improve-perf-of-owners-filespec branch August 18, 2022 22:47
Eric-Arellano added a commit that referenced this pull request Aug 19, 2022
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Aug 20, 2022
Possibly will close pantsbuild#16122.

@stuhood found in pantsbuild#16122 (comment) that we were spending substantial time in Rust parsing the same globs in the stateless function `matches_filespec`.

This improves the situation by introducing `FilespecsMatcher`, which allows us to persist the globs parsing over time. Crucially, `FilespecsMatcher` eagerly parses the globs in its constructor, unlike `PathGlobs` currently being lazy. Then, we  use a `@memoized_method` on each `SourcesField` instance to avoid ever creating the matcher more than necessary.

Before:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.562 s ±  0.153 s    [User: 7.354 s, System: 1.516 s]
  Range (min … max):    6.420 s …  6.878 s    10 runs
```

After:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.460 s ±  0.065 s    [User: 7.301 s, System: 1.447 s]
  Range (min … max):    6.340 s …  6.576 s    10 runs
```

(Note, we expect the performance to be better in repos with the shape described at pantsbuild#16122 (comment))

## Relationship to `PathGlobs`

We also considered changing `PathGlobs` to eagerly parse, and then defining `PathGlobs.matches()`, rather than introducing the new type `FilespecsMatcher`.

I believe `FilespecsMatcher` is a clearer API: it is only used for in-memory checks if globs match paths, whereas `PathGlobs` is used for on-disk.

Notably, the Rust code only overlaps in how the exclude patterns are ignored. So, making this split allows us to simplify `PathGlobs`.

A downside of having separate types is it's harder to share the duplicate parsing of the exclude patterns. This seems fine, given that the include patterns are separate code paths.

(See pantsbuild#16564 for improving `PathGlobs` to also eagerly parse.)
# Conflicts:
#	docs/markdown/Writing Plugins/common-plugin-tasks/plugin-upgrade-guide.md
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Aug 22, 2022
Possibly will close pantsbuild#16122.

@stuhood found in pantsbuild#16122 (comment) that we were spending substantial time in Rust parsing the same globs in the stateless function `matches_filespec`.

This improves the situation by introducing `FilespecsMatcher`, which allows us to persist the globs parsing over time. Crucially, `FilespecsMatcher` eagerly parses the globs in its constructor, unlike `PathGlobs` currently being lazy. Then, we  use a `@memoized_method` on each `SourcesField` instance to avoid ever creating the matcher more than necessary.

Before:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.562 s ±  0.153 s    [User: 7.354 s, System: 1.516 s]
  Range (min … max):    6.420 s …  6.878 s    10 runs
```

After:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.460 s ±  0.065 s    [User: 7.301 s, System: 1.447 s]
  Range (min … max):    6.340 s …  6.576 s    10 runs
```

(Note, we expect the performance to be better in repos with the shape described at pantsbuild#16122 (comment))

We also considered changing `PathGlobs` to eagerly parse, and then defining `PathGlobs.matches()`, rather than introducing the new type `FilespecsMatcher`.

I believe `FilespecsMatcher` is a clearer API: it is only used for in-memory checks if globs match paths, whereas `PathGlobs` is used for on-disk.

Notably, the Rust code only overlaps in how the exclude patterns are ignored. So, making this split allows us to simplify `PathGlobs`.

A downside of having separate types is it's harder to share the duplicate parsing of the exclude patterns. This seems fine, given that the include patterns are separate code paths.

(See pantsbuild#16564 for improving `PathGlobs` to also eagerly parse.)

[ci skip-build-wheels]

[ci skip-rust]

[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this pull request Aug 22, 2022
Possibly will close #16122.

@stuhood found in #16122 (comment) that we were spending substantial time in Rust parsing the same globs in the stateless function `matches_filespec`.

This improves the situation by introducing `FilespecsMatcher`, which allows us to persist the globs parsing over time. Crucially, `FilespecsMatcher` eagerly parses the globs in its constructor, unlike `PathGlobs` currently being lazy. Then, we  use a `@memoized_method` on each `SourcesField` instance to avoid ever creating the matcher more than necessary.

Before:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.562 s ±  0.153 s    [User: 7.354 s, System: 1.516 s]
  Range (min … max):    6.420 s …  6.878 s    10 runs
```

After:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.460 s ±  0.065 s    [User: 7.301 s, System: 1.447 s]
  Range (min … max):    6.340 s …  6.576 s    10 runs
```

(Note, we expect the performance to be better in repos with the shape described at #16122 (comment))

## Relationship to `PathGlobs`

We also considered changing `PathGlobs` to eagerly parse, and then defining `PathGlobs.matches()`, rather than introducing the new type `FilespecsMatcher`.

I believe `FilespecsMatcher` is a clearer API: it is only used for in-memory checks if globs match paths, whereas `PathGlobs` is used for on-disk.

Notably, the Rust code only overlaps in how the exclude patterns are ignored. So, making this split allows us to simplify `PathGlobs`.

A downside of having separate types is it's harder to share the duplicate parsing of the exclude patterns. This seems fine, given that the include patterns are separate code paths.

(See #16564 for improving `PathGlobs` to also eagerly parse.)
cczona pushed a commit to cczona/pants that referenced this pull request Sep 1, 2022
Possibly will close pantsbuild#16122.

@stuhood found in pantsbuild#16122 (comment) that we were spending substantial time in Rust parsing the same globs in the stateless function `matches_filespec`.

This improves the situation by introducing `FilespecsMatcher`, which allows us to persist the globs parsing over time. Crucially, `FilespecsMatcher` eagerly parses the globs in its constructor, unlike `PathGlobs` currently being lazy. Then, we  use a `@memoized_method` on each `SourcesField` instance to avoid ever creating the matcher more than necessary.

Before:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.562 s ±  0.153 s    [User: 7.354 s, System: 1.516 s]
  Range (min … max):    6.420 s …  6.878 s    10 runs
```

After:

```
❯ hyperfine -w 2 -r 10 './pants --no-pantsd --no-buildsense-enable --no-anonymous-telemetry-enabled dependencies ::'
  Time (mean ± σ):      6.460 s ±  0.065 s    [User: 7.301 s, System: 1.447 s]
  Range (min … max):    6.340 s …  6.576 s    10 runs
```

(Note, we expect the performance to be better in repos with the shape described at pantsbuild#16122 (comment))

## Relationship to `PathGlobs`

We also considered changing `PathGlobs` to eagerly parse, and then defining `PathGlobs.matches()`, rather than introducing the new type `FilespecsMatcher`.

I believe `FilespecsMatcher` is a clearer API: it is only used for in-memory checks if globs match paths, whereas `PathGlobs` is used for on-disk. 

Notably, the Rust code only overlaps in how the exclude patterns are ignored. So, making this split allows us to simplify `PathGlobs`.

A downside of having separate types is it's harder to share the duplicate parsing of the exclude patterns. This seems fine, given that the include patterns are separate code paths.

(See pantsbuild#16564 for improving `PathGlobs` to also eagerly parse.)
cczona pushed a commit to cczona/pants that referenced this pull request Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pants dependency calculation performance regression on 2.13.x
2 participants