-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip panda rules if panda module hasn't been seen #14671
Conversation
|
code | total | + violation | - violation | + fix | - fix |
---|---|---|---|---|---|
PD011 | 94 | 0 | 94 | 0 | 0 |
PD009 | 4 | 0 | 4 | 0 | 0 |
Linter (preview)
ℹ️ ecosystem check detected linter changes. (+0 -98 violations, +0 -0 fixes in 4 projects; 51 projects unchanged)
apache/airflow (+0 -27 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL
- providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:192:16: PD011 Use `.to_numpy()` instead of `.values` - providers/src/airflow/providers/google/cloud/hooks/vertex_ai/generative_model.py:348:16: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/apache/drill/hooks/test_drill.py:103:31: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/apache/druid/hooks/test_druid.py:456:31: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/apache/impala/hooks/test_impala.py:120:33: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/apache/impala/hooks/test_impala.py:121:33: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/apache/pinot/hooks/test_pinot.py:274:31: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/elasticsearch/hooks/test_elasticsearch.py:118:37: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/elasticsearch/hooks/test_elasticsearch.py:119:37: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/google/suite/hooks/test_sheets.py:112:28: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/google/suite/hooks/test_sheets.py:134:25: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/google/suite/hooks/test_sheets.py:162:31: PD011 Use `.to_numpy()` instead of `.values` - providers/tests/google/suite/hooks/test_sheets.py:192:31: PD011 Use `.to_numpy()` instead of `.values` ... 14 additional changes omitted for project
apache/superset (+0 -11 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL
- superset/db_engine_specs/clickhouse.py:174:23: PD011 Use `.to_numpy()` instead of `.values` - tests/integration_tests/model_tests.py:363:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy. - tests/integration_tests/model_tests.py:366:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy. - tests/integration_tests/model_tests.py:373:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy. - tests/integration_tests/model_tests.py:376:20: PD009 Use `.iloc` instead of `.iat`. If speed is important, use NumPy. - tests/integration_tests/model_tests.py:491:22: PD011 Use `.to_numpy()` instead of `.values` - tests/integration_tests/model_tests.py:493:22: PD011 Use `.to_numpy()` instead of `.values` - tests/unit_tests/commands/databases/columnar_reader_test.py:123:21: PD011 Use `.to_numpy()` instead of `.values` - tests/unit_tests/commands/databases/columnar_reader_test.py:196:12: PD011 Use `.to_numpy()` instead of `.values` - tests/unit_tests/commands/databases/csv_reader_test.py:255:21: PD011 Use `.to_numpy()` instead of `.values` ... 2 additional changes omitted for rule PD011 ... 1 additional changes omitted for project
bokeh/bokeh (+0 -56 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL
- examples/plotting/periodic_shells.py:46:29: PD011 Use `.to_numpy()` instead of `.values` - examples/topics/stats/density.py:29:12: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:517:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:518:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:519:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:520:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:521:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:533:53: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:544:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:545:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:546:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:547:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:548:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:565:53: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:585:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:586:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:587:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:588:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:589:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:601:53: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:617:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:618:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:619:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:620:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:621:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:633:53: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:644:17: PD011 Use `.to_numpy()` instead of `.values` - tests/unit/bokeh/test_client_server.py:645:17: PD011 Use `.to_numpy()` instead of `.values` ... 28 additional changes omitted for project
latchbio/latch (+0 -4 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview
- src/latch/registry/record.py:306:12: PD011 Use `.to_numpy()` instead of `.values` - src/latch/registry/record.py:309:16: PD011 Use `.to_numpy()` instead of `.values` - src/latch/registry/utils.py:438:19: PD011 Use `.to_numpy()` instead of `.values` - src/latch_cli/services/get_params.py:330:37: PD011 Use `.to_numpy()` instead of `.values`
Changes by rule (2 rules affected)
code | total | + violation | - violation | + fix | - fix |
---|---|---|---|---|---|
PD011 | 94 | 0 | 94 | 0 | 0 |
PD009 | 4 | 0 | 4 | 0 | 0 |
As expected, this increases false negatives but also reduces false positives. The airflow and bokeh are a mix of new false negatives and fewer false positives. Superset is almost exclusively more false-negatives. |
As someone who works a lot with pandas, spark, polars etc. I think this is the right trade-off for now. False positives are a reason to disable the rules completely. False negatives are mostly in test files or "glue code" where type annotations are missing. In production code it's more common to write pure functions that take DataFrames as input and output, and if these functions are typed then the rules apply. The rules can be expanded again when inference of DataFrame and Series types is possible. |
I'm just pointing out that this is currently not the case. It would require adding type inference support for panda data frames. |
Ah sorry, I meant that when functions are typed, |
It's possible to think of heuristics we could use to avoid some of the new false negatives. For example, this function is pretty clearly using pandas because the function name has |
@AlexWaygood I don't think this is worth special casing because it only helps in a few limited cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me.
There are also some other libraries that the pandas docs specifically calls out as either being pandas
plugins/extensions, or as making heavy use of pandas
. Possibly we could also emit these lints if one of those has been imported: https://pandas.pydata.org/community/ecosystem.html
I like the idea but I'll leave it as is for now. I'm not sure how I feel about adding that many modules. Let's see if it comes up in user reports. |
Summary
Issues about
PD011
and other panda rules triggering on non-panda code are frequent:.isnull()
, not necessarily on a pandas dataframe #11235PD011
and Django's TextChoices #11858PD011
whenpd
does not exist in module #3807polars
dataframes #14301There are probably more. At this point, I consider the rule harmful to the majority of users.
This PR adds an
seen_module
check for all panda rules. This likely results in more false negatives, but I considerthis is the better outcome, considering how many users the rules confused in the past.
A proper fix requires type inference
Test Plan
I verified that the following code snippet doesn't trigger a pandas violation anymore, unless I change the import to
pandas