Skip panda rules if panda module hasn't been seen #14671

MichaReiser · 2024-11-29T07:33:31Z

Summary

Issues about PD011 and other panda rules triggering on non-panda code are frequent:

There are probably more. At this point, I consider the rule harmful to the majority of users.

This PR adds an seen_module check for all panda rules. This likely results in more false negatives, but I consider
this is the better outcome, considering how many users the rules confused in the past.

A proper fix requires type inference

Test Plan

I verified that the following code snippet doesn't trigger a pandas violation anymore, unless I change the import to pandas

import polars as pl

pl_df = pl.DataFrame([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
pl_df.pivot(on="a", values="b")

github-actions · 2024-11-29T07:39:58Z

`ruff-ecosystem` results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+0 -98 violations, +0 -0 fixes in 4 projects; 51 projects unchanged)

apache/airflow (+0 -27 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --no-preview --select ALL

- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:192:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:348:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/drill/hooks/test_drill.py:103:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/druid/hooks/test_druid.py:456:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:120:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:121:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/pinot/hooks/test_pinot.py:274:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:118:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:119:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:112:28: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:134:25: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:162:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:192:31: PD011 Use `.to_numpy()` instead of `.values`
... 14 additional changes omitted for project

apache/superset (+0 -11 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --no-preview --select ALL

- superset/db_engine_specs/clickhouse.py:174:23: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:363:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:366:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:373:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:376:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:491:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:493:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:123:21: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:196:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/csv_reader_test.py:255:21: PD011 Use `.to_numpy()` instead of `.values`
... 2 additional changes omitted for rule PD011
... 1 additional changes omitted for project

bokeh/bokeh (+0 -56 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --no-preview --select ALL

- examples/plotting/periodic_shells.py:46:29: PD011 Use `.to_numpy()` instead of `.values`
- examples/topics/stats/density.py:29:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:517:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:518:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:519:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:520:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:521:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:533:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:544:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:545:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:546:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:547:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:548:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:565:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:585:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:586:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:587:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:588:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:589:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:601:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:617:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:618:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:619:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:620:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:621:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:633:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:644:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:645:17: PD011 Use `.to_numpy()` instead of `.values`
... 28 additional changes omitted for project

latchbio/latch (+0 -4 violations, +0 -0 fixes)

- src/latch/registry/record.py:306:12: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/record.py:309:16: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/utils.py:438:19: PD011 Use `.to_numpy()` instead of `.values`
- src/latch_cli/services/get_params.py:330:37: PD011 Use `.to_numpy()` instead of `.values`

Changes by rule (2 rules affected)

code	total	+ violation	- violation	+ fix	- fix
PD011	94	0	94	0	0
PD009	4	0	4	0	0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+0 -98 violations, +0 -0 fixes in 4 projects; 51 projects unchanged)

apache/airflow (+0 -27 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:192:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:348:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/drill/hooks/test_drill.py:103:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/druid/hooks/test_druid.py:456:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:120:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:121:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/pinot/hooks/test_pinot.py:274:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:118:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:119:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:112:28: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:134:25: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:162:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:192:31: PD011 Use `.to_numpy()` instead of `.values`
... 14 additional changes omitted for project

apache/superset (+0 -11 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- superset/db_engine_specs/clickhouse.py:174:23: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:363:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:366:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:373:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:376:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:491:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:493:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:123:21: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:196:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/csv_reader_test.py:255:21: PD011 Use `.to_numpy()` instead of `.values`
... 2 additional changes omitted for rule PD011
... 1 additional changes omitted for project

bokeh/bokeh (+0 -56 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- examples/plotting/periodic_shells.py:46:29: PD011 Use `.to_numpy()` instead of `.values`
- examples/topics/stats/density.py:29:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:517:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:518:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:519:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:520:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:521:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:533:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:544:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:545:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:546:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:547:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:548:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:565:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:585:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:586:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:587:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:588:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:589:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:601:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:617:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:618:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:619:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:620:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:621:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:633:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:644:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:645:17: PD011 Use `.to_numpy()` instead of `.values`
... 28 additional changes omitted for project

latchbio/latch (+0 -4 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- src/latch/registry/record.py:306:12: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/record.py:309:16: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/utils.py:438:19: PD011 Use `.to_numpy()` instead of `.values`
- src/latch_cli/services/get_params.py:330:37: PD011 Use `.to_numpy()` instead of `.values`

Changes by rule (2 rules affected)

code	total	+ violation	- violation	+ fix	- fix
PD011	94	0	94	0	0
PD009	4	0	4	0	0

MichaReiser · 2024-11-29T07:48:54Z

As expected, this increases false negatives but also reduces false positives. The airflow and bokeh are a mix of new false negatives and fewer false positives. Superset is almost exclusively more false-negatives.

sbrugman · 2024-11-29T11:51:12Z

As someone who works a lot with pandas, spark, polars etc. I think this is the right trade-off for now.

False positives are a reason to disable the rules completely.

False negatives are mostly in test files or "glue code" where type annotations are missing. In production code it's more common to write pure functions that take DataFrames as input and output, and if these functions are typed then the rules apply.

The rules can be expanded again when inference of DataFrame and Series types is possible.

MichaReiser · 2024-11-29T11:54:47Z

False negatives are mostly in test files or "glue code" where type annotations are missing. In production code it's more common to write pure functions that take DataFrames as input and output, and if these functions are typed then the rules apply.

I'm just pointing out that this is currently not the case. It would require adding type inference support for panda data frames.

sbrugman · 2024-11-29T12:26:45Z

Ah sorry, I meant that when functions are typed, pandas is also imported and thus the "has seen module" logic applies.

AlexWaygood · 2024-11-29T12:42:08Z

It's possible to think of heuristics we could use to avoid some of the new false negatives. For example, this function is pretty clearly using pandas because the function name has pandas in it. Using heuristics like that has the downside that it makes the rule harder to understand for users, however.

MichaReiser · 2024-11-29T12:48:25Z

@AlexWaygood I don't think this is worth special casing because it only helps in a few limited cases.

AlexWaygood · 2024-11-29T12:58:15Z

@AlexWaygood I don't think this is worth special casing because it only helps in a few limited cases.

It would actually get rid of seven new false negatives on airflow highlighted by the ecosystem check:

But no others. So, agreed that it probably isn't worth special-casing.

AlexWaygood

This change makes sense to me.

There are also some other libraries that the pandas docs specifically calls out as either being pandas plugins/extensions, or as making heavy use of pandas. Possibly we could also emit these lints if one of those has been imported: https://pandas.pydata.org/community/ecosystem.html

MichaReiser · 2024-11-29T21:32:46Z

I like the idea but I'll leave it as is for now. I'm not sure how I feel about adding that many modules. Let's see if it comes up in user reports.

Skip panda rules if panda module hasn't been seen

b704cad

MichaReiser added rule Implementing or modifying a lint rule bug Something isn't working labels Nov 29, 2024

MichaReiser requested a review from charliermarsh November 29, 2024 07:48

AlexWaygood approved these changes Nov 29, 2024

View reviewed changes

MichaReiser merged commit 90487b8 into main Nov 29, 2024
21 checks passed

MichaReiser deleted the micha/disable-pd-rules-for-non-pd-code branch November 29, 2024 21:32

MichaReiser mentioned this pull request Dec 2, 2024

Revert: [pyflakes] Avoid false positives in @no_type_check contexts (F821, F722) (#14615) #14726

Merged

BrewTestBot mentioned this pull request Dec 5, 2024

ruff 0.8.2 Homebrew/homebrew-core#200187

Merged

MichaReiser mentioned this pull request Dec 21, 2024

False PD013 with pymc #15092

Closed

dhruvmanila mentioned this pull request Jan 29, 2025

PD rules trigger on non-Pandas DataFrames #6432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip panda rules if panda module hasn't been seen #14671

Skip panda rules if panda module hasn't been seen #14671

MichaReiser commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024

MichaReiser commented Nov 29, 2024

sbrugman commented Nov 29, 2024

MichaReiser commented Nov 29, 2024

sbrugman commented Nov 29, 2024 •

edited by AlexWaygood

Loading

AlexWaygood commented Nov 29, 2024

MichaReiser commented Nov 29, 2024

AlexWaygood commented Nov 29, 2024

AlexWaygood left a comment •

edited

Loading

MichaReiser commented Nov 29, 2024

Skip panda rules if panda module hasn't been seen #14671

Skip panda rules if panda module hasn't been seen #14671

Conversation

MichaReiser commented Nov 29, 2024 • edited Loading

Summary

Test Plan

github-actions bot commented Nov 29, 2024

ruff-ecosystem results

Linter (stable)

Linter (preview)

MichaReiser commented Nov 29, 2024

sbrugman commented Nov 29, 2024

MichaReiser commented Nov 29, 2024

sbrugman commented Nov 29, 2024 • edited by AlexWaygood Loading

AlexWaygood commented Nov 29, 2024

MichaReiser commented Nov 29, 2024

AlexWaygood commented Nov 29, 2024

AlexWaygood left a comment • edited Loading

Choose a reason for hiding this comment

MichaReiser commented Nov 29, 2024

MichaReiser commented Nov 29, 2024 •

edited

Loading

`ruff-ecosystem` results

sbrugman commented Nov 29, 2024 •

edited by AlexWaygood

Loading

AlexWaygood left a comment •

edited

Loading