Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with linter targets being skipped #10974

Merged
merged 2 commits into from
Oct 16, 2020

Conversation

gshuflin
Copy link
Contributor

Problem

We noticed an issue where, when running the ./pants lint command on a large number of targets in a repository, some targets were being completely skipped by the flake8 process, resulting in the flake8 linter output falsely reporting all good, when there were actually files in the repo with linter errors.

The problem turned out to lie in the group_field_sets_by_constraints method. This method takes as its input an unsorted collection of field sets corresponding to the input targets, and groups them by their python interpreter contraint. This method is used as part of the pipeline for running the flake8 process on python source files.

Internally, this method calls the python standard library itertools.groupby method. It turns out that groupby does not work as expected with unsorted input data - it generates a new sub-iterable every time the sorting key changes (in this case, the interpreter constraint), rather than creating as many sub-iterables as there were distinct sorting keys in the input data. Because we were taking the output of this method and using it in a dictionary comprehension, we were accidentally overwriting dictionary values in a non-deterministic way, resulting in some filed sets getting skipped before the flake8 process could run on them.

Solution

group_field_sets_by_constraints was rewritten to avoid using itertools.groupby altogether, so we no longer skip inputs; and a test was added to make sure that we handle unsorted field set inputs to this method correctly.

[ci skip-rust]

[ci skip-build-wheels]
[ci skip-rust]

[ci skip-build-wheels]
@coveralls
Copy link

Coverage Status

Coverage remained the same at 0.0% when pulling b9ba825 on gshuflin:debugging-flake8-2 into b32f1d1 on pantsbuild:master.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Greg!

@gshuflin gshuflin merged commit 1143af4 into pantsbuild:master Oct 16, 2020
Eric-Arellano pushed a commit to Eric-Arellano/pants that referenced this pull request Oct 16, 2020
### Problem

We noticed an issue where, when running the `./pants lint` command on a large number of targets in a repository, some targets were being completely skipped by the flake8 process, resulting in the flake8 linter output falsely reporting all good, when there were actually files in the repo with linter errors.

The problem turned out to lie in the `group_field_sets_by_constraints` method. This method takes as its input an unsorted collection of field sets corresponding to the input targets, and groups them by their python interpreter contraint. This method is used as part of the pipeline for running the flake8 process on python source files.

Internally, this method calls the python standard library `itertools.groupby` method. It turns out that `groupby` does not work as expected with unsorted input data - it generates a new sub-iterable every time the sorting key changes (in this case, the interpreter constraint), rather than creating as many sub-iterables as there were distinct sorting keys in the input data. Because we were taking the output of this method and using it in a dictionary comprehension, we were accidentally overwriting dictionary values in a non-deterministic way, resulting in some filed sets getting skipped before the flake8 process could run on them.

### Solution

`group_field_sets_by_constraints` was rewritten to avoid using `itertools.groupby` altogether, so we no longer skip inputs; and a test was added to make sure that we handle unsorted field set inputs to this method correctly.
@gshuflin gshuflin deleted the debugging-flake8-2 branch October 16, 2020 02:44
gshuflin added a commit that referenced this pull request Oct 16, 2020
### Problem

We noticed an issue where, when running the `./pants lint` command on a large number of targets in a repository, some targets were being completely skipped by the flake8 process, resulting in the flake8 linter output falsely reporting all good, when there were actually files in the repo with linter errors.

The problem turned out to lie in the `group_field_sets_by_constraints` method. This method takes as its input an unsorted collection of field sets corresponding to the input targets, and groups them by their python interpreter contraint. This method is used as part of the pipeline for running the flake8 process on python source files.

Internally, this method calls the python standard library `itertools.groupby` method. It turns out that `groupby` does not work as expected with unsorted input data - it generates a new sub-iterable every time the sorting key changes (in this case, the interpreter constraint), rather than creating as many sub-iterables as there were distinct sorting keys in the input data. Because we were taking the output of this method and using it in a dictionary comprehension, we were accidentally overwriting dictionary values in a non-deterministic way, resulting in some filed sets getting skipped before the flake8 process could run on them.

### Solution

`group_field_sets_by_constraints` was rewritten to avoid using `itertools.groupby` altogether, so we no longer skip inputs; and a test was added to make sure that we handle unsorted field set inputs to this method correctly.

Co-authored-by: gshuflin <[email protected]>
Eric-Arellano added a commit that referenced this pull request Oct 16, 2020
We discovered in #10974 that `itertools.groupby()` requires you to pre-sort the data to work properly:

From https://docs.python.org/3/library/itertools.html#itertools.groupby:

> The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

[ci skip-rust]
[ci skip-build-wheels]
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Oct 16, 2020
We discovered in pantsbuild#10974 that `itertools.groupby()` requires you to pre-sort the data to work properly:

From https://docs.python.org/3/library/itertools.html#itertools.groupby:

> The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

[ci skip-rust]
[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants