Fix bug with linter targets being skipped #10974

gshuflin · 2020-10-16T00:06:13Z

Problem

We noticed an issue where, when running the ./pants lint command on a large number of targets in a repository, some targets were being completely skipped by the flake8 process, resulting in the flake8 linter output falsely reporting all good, when there were actually files in the repo with linter errors.

The problem turned out to lie in the group_field_sets_by_constraints method. This method takes as its input an unsorted collection of field sets corresponding to the input targets, and groups them by their python interpreter contraint. This method is used as part of the pipeline for running the flake8 process on python source files.

Internally, this method calls the python standard library itertools.groupby method. It turns out that groupby does not work as expected with unsorted input data - it generates a new sub-iterable every time the sorting key changes (in this case, the interpreter constraint), rather than creating as many sub-iterables as there were distinct sorting keys in the input data. Because we were taking the output of this method and using it in a dictionary comprehension, we were accidentally overwriting dictionary values in a non-deterministic way, resulting in some filed sets getting skipped before the flake8 process could run on them.

Solution

group_field_sets_by_constraints was rewritten to avoid using itertools.groupby altogether, so we no longer skip inputs; and a test was added to make sure that we handle unsorted field set inputs to this method correctly.

[ci skip-rust] [ci skip-build-wheels]

coveralls · 2020-10-16T00:46:36Z

Coverage remained the same at 0.0% when pulling b9ba825 on gshuflin:debugging-flake8-2 into b32f1d1 on pantsbuild:master.

Eric-Arellano

Thank you Greg!

### Problem We noticed an issue where, when running the `./pants lint` command on a large number of targets in a repository, some targets were being completely skipped by the flake8 process, resulting in the flake8 linter output falsely reporting all good, when there were actually files in the repo with linter errors. The problem turned out to lie in the `group_field_sets_by_constraints` method. This method takes as its input an unsorted collection of field sets corresponding to the input targets, and groups them by their python interpreter contraint. This method is used as part of the pipeline for running the flake8 process on python source files. Internally, this method calls the python standard library `itertools.groupby` method. It turns out that `groupby` does not work as expected with unsorted input data - it generates a new sub-iterable every time the sorting key changes (in this case, the interpreter constraint), rather than creating as many sub-iterables as there were distinct sorting keys in the input data. Because we were taking the output of this method and using it in a dictionary comprehension, we were accidentally overwriting dictionary values in a non-deterministic way, resulting in some filed sets getting skipped before the flake8 process could run on them. ### Solution `group_field_sets_by_constraints` was rewritten to avoid using `itertools.groupby` altogether, so we no longer skip inputs; and a test was added to make sure that we handle unsorted field set inputs to this method correctly.

### Problem We noticed an issue where, when running the `./pants lint` command on a large number of targets in a repository, some targets were being completely skipped by the flake8 process, resulting in the flake8 linter output falsely reporting all good, when there were actually files in the repo with linter errors. The problem turned out to lie in the `group_field_sets_by_constraints` method. This method takes as its input an unsorted collection of field sets corresponding to the input targets, and groups them by their python interpreter contraint. This method is used as part of the pipeline for running the flake8 process on python source files. Internally, this method calls the python standard library `itertools.groupby` method. It turns out that `groupby` does not work as expected with unsorted input data - it generates a new sub-iterable every time the sorting key changes (in this case, the interpreter constraint), rather than creating as many sub-iterables as there were distinct sorting keys in the input data. Because we were taking the output of this method and using it in a dictionary comprehension, we were accidentally overwriting dictionary values in a non-deterministic way, resulting in some filed sets getting skipped before the flake8 process could run on them. ### Solution `group_field_sets_by_constraints` was rewritten to avoid using `itertools.groupby` altogether, so we no longer skip inputs; and a test was added to make sure that we handle unsorted field set inputs to this method correctly. Co-authored-by: gshuflin <[email protected]>

We discovered in #10974 that `itertools.groupby()` requires you to pre-sort the data to work properly: From https://docs.python.org/3/library/itertools.html#itertools.groupby: > The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order. [ci skip-rust] [ci skip-build-wheels]

We discovered in pantsbuild#10974 that `itertools.groupby()` requires you to pre-sort the data to work properly: From https://docs.python.org/3/library/itertools.html#itertools.groupby: > The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order. [ci skip-rust] [ci skip-build-wheels]

gshuflin added 2 commits October 15, 2020 15:50

Fix flake8 bug

1f3187b

[ci skip-rust] [ci skip-build-wheels]

Test

b9ba825

[ci skip-rust] [ci skip-build-wheels]

gshuflin requested a review from Eric-Arellano October 16, 2020 00:06

Eric-Arellano approved these changes Oct 16, 2020

View reviewed changes

gshuflin merged commit 1143af4 into pantsbuild:master Oct 16, 2020

gshuflin deleted the debugging-flake8-2 branch October 16, 2020 02:44

Eric-Arellano mentioned this pull request Oct 16, 2020

Fix several bad usages of itertools.groupby() #10976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with linter targets being skipped #10974

Fix bug with linter targets being skipped #10974

gshuflin commented Oct 16, 2020

coveralls commented Oct 16, 2020

Eric-Arellano left a comment

Fix bug with linter targets being skipped #10974

Fix bug with linter targets being skipped #10974

Conversation

gshuflin commented Oct 16, 2020

Problem

Solution

coveralls commented Oct 16, 2020

Eric-Arellano left a comment

Choose a reason for hiding this comment