Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip panda rules if panda module hasn't been seen #14671

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Nov 29, 2024

Summary

Issues about PD011 and other panda rules triggering on non-panda code are frequent:

There are probably more. At this point, I consider the rule harmful to the majority of users.

This PR adds an seen_module check for all panda rules. This likely results in more false negatives, but I consider
this is the better outcome, considering how many users the rules confused in the past.

A proper fix requires type inference

Test Plan

I verified that the following code snippet doesn't trigger a pandas violation anymore, unless I change the import to pandas

import polars as pl

pl_df = pl.DataFrame([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
pl_df.pivot(on="a", values="b")

@MichaReiser MichaReiser added rule Implementing or modifying a lint rule bug Something isn't working labels Nov 29, 2024
Copy link
Contributor

ruff-ecosystem results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+0 -98 violations, +0 -0 fixes in 4 projects; 51 projects unchanged)

apache/airflow (+0 -27 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --no-preview --select ALL

- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:192:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:348:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/drill/hooks/test_drill.py:103:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/druid/hooks/test_druid.py:456:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:120:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:121:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/pinot/hooks/test_pinot.py:274:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:118:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:119:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:112:28: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:134:25: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:162:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:192:31: PD011 Use `.to_numpy()` instead of `.values`
... 14 additional changes omitted for project

apache/superset (+0 -11 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --no-preview --select ALL

- superset/db_engine_specs/clickhouse.py:174:23: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:363:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:366:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:373:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:376:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:491:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:493:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:123:21: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:196:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/csv_reader_test.py:255:21: PD011 Use `.to_numpy()` instead of `.values`
... 2 additional changes omitted for rule PD011
... 1 additional changes omitted for project

bokeh/bokeh (+0 -56 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --no-preview --select ALL

- examples/plotting/periodic_shells.py:46:29: PD011 Use `.to_numpy()` instead of `.values`
- examples/topics/stats/density.py:29:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:517:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:518:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:519:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:520:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:521:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:533:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:544:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:545:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:546:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:547:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:548:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:565:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:585:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:586:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:587:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:588:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:589:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:601:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:617:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:618:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:619:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:620:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:621:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:633:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:644:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:645:17: PD011 Use `.to_numpy()` instead of `.values`
... 28 additional changes omitted for project

latchbio/latch (+0 -4 violations, +0 -0 fixes)

- src/latch/registry/record.py:306:12: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/record.py:309:16: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/utils.py:438:19: PD011 Use `.to_numpy()` instead of `.values`
- src/latch_cli/services/get_params.py:330:37: PD011 Use `.to_numpy()` instead of `.values`

Changes by rule (2 rules affected)

code total + violation - violation + fix - fix
PD011 94 0 94 0 0
PD009 4 0 4 0 0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+0 -98 violations, +0 -0 fixes in 4 projects; 51 projects unchanged)

apache/airflow (+0 -27 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:192:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:348:16: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/drill/hooks/test_drill.py:103:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/druid/hooks/test_druid.py:456:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:120:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/impala/hooks/test_impala.py:121:33: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/apache/pinot/hooks/test_pinot.py:274:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:118:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/elasticsearch/hooks/test_elasticsearch.py:119:37: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:112:28: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:134:25: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:162:31: PD011 Use `.to_numpy()` instead of `.values`
- providers/tests/google/suite/hooks/test_sheets.py:192:31: PD011 Use `.to_numpy()` instead of `.values`
... 14 additional changes omitted for project

apache/superset (+0 -11 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- superset/db_engine_specs/clickhouse.py:174:23: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:363:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:366:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:373:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:376:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy.
- tests/integration_tests/model_tests.py:491:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/integration_tests/model_tests.py:493:22: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:123:21: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/columnar_reader_test.py:196:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit_tests/commands/databases/csv_reader_test.py:255:21: PD011 Use `.to_numpy()` instead of `.values`
... 2 additional changes omitted for rule PD011
... 1 additional changes omitted for project

bokeh/bokeh (+0 -56 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

- examples/plotting/periodic_shells.py:46:29: PD011 Use `.to_numpy()` instead of `.values`
- examples/topics/stats/density.py:29:12: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:517:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:518:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:519:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:520:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:521:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:533:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:544:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:545:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:546:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:547:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:548:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:565:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:585:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:586:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:587:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:588:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:589:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:601:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:617:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:618:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:619:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:620:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:621:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:633:53: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:644:17: PD011 Use `.to_numpy()` instead of `.values`
- tests/unit/bokeh/test_client_server.py:645:17: PD011 Use `.to_numpy()` instead of `.values`
... 28 additional changes omitted for project

latchbio/latch (+0 -4 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

- src/latch/registry/record.py:306:12: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/record.py:309:16: PD011 Use `.to_numpy()` instead of `.values`
- src/latch/registry/utils.py:438:19: PD011 Use `.to_numpy()` instead of `.values`
- src/latch_cli/services/get_params.py:330:37: PD011 Use `.to_numpy()` instead of `.values`

Changes by rule (2 rules affected)

code total + violation - violation + fix - fix
PD011 94 0 94 0 0
PD009 4 0 4 0 0

@MichaReiser
Copy link
Member Author

As expected, this increases false negatives but also reduces false positives. The airflow and bokeh are a mix of new false negatives and fewer false positives. Superset is almost exclusively more false-negatives.

@sbrugman
Copy link
Contributor

As someone who works a lot with pandas, spark, polars etc. I think this is the right trade-off for now.

False positives are a reason to disable the rules completely.

False negatives are mostly in test files or "glue code" where type annotations are missing. In production code it's more common to write pure functions that take DataFrames as input and output, and if these functions are typed then the rules apply.

The rules can be expanded again when inference of DataFrame and Series types is possible.

@MichaReiser
Copy link
Member Author

False negatives are mostly in test files or "glue code" where type annotations are missing. In production code it's more common to write pure functions that take DataFrames as input and output, and if these functions are typed then the rules apply.

I'm just pointing out that this is currently not the case. It would require adding type inference support for panda data frames.

@sbrugman
Copy link
Contributor

sbrugman commented Nov 29, 2024

Ah sorry, I meant that when functions are typed, pandas is also imported and thus the "has seen module" logic applies.

@AlexWaygood
Copy link
Member

It's possible to think of heuristics we could use to avoid some of the new false negatives. For example, this function is pretty clearly using pandas because the function name has pandas in it. Using heuristics like that has the downside that it makes the rule harder to understand for users, however.

@MichaReiser
Copy link
Member Author

@AlexWaygood I don't think this is worth special casing because it only helps in a few limited cases.

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me.

There are also some other libraries that the pandas docs specifically calls out as either being pandas plugins/extensions, or as making heavy use of pandas. Possibly we could also emit these lints if one of those has been imported: https://pandas.pydata.org/community/ecosystem.html

@MichaReiser
Copy link
Member Author

I like the idea but I'll leave it as is for now. I'm not sure how I feel about adding that many modules. Let's see if it comes up in user reports.

@MichaReiser MichaReiser merged commit 90487b8 into main Nov 29, 2024
21 checks passed
@MichaReiser MichaReiser deleted the micha/disable-pd-rules-for-non-pd-code branch November 29, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rule Implementing or modifying a lint rule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants