-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[
pandas-vet
] series constant series (#5802)
## Summary Implementation for #5588 Q1: are there any additional semantic helpers that could be used to guard this rule? Which existing rules should be similar in that respect? Can we at least check if `pandas` is imported (any pointers welcome)? Currently, the rule flags: ```python data = {"a": "b"} data.nunique() == 1 ``` Q2: Any pointers on naming of the rule and selection of the code? It was proposed, but not replied to/implemented in the upstream. `pandas` did accept a PR to update their cookbook to reflect this rule though. ## Test Plan TODO: - [X] Checking for ecosystem CI results - [x] Test on selected [real-world cases](https://github.com/search?q=%22nunique%28%29+%3D%3D+1%22+language%3APython+&type=code) - [x] https://github.com/sdv-dev/SDMetrics - [x] https://github.com/google-research/robustness_metrics - [x] https://github.com/soft-matter/trackpy - [x] https://github.com/microsoft/FLAML/ - [ ] Add guarded test cases
- Loading branch information
Showing
8 changed files
with
287 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
import pandas as pd | ||
|
||
|
||
data = pd.Series(range(1000)) | ||
|
||
# PD101 | ||
data.nunique() <= 1 | ||
data.nunique(dropna=True) <= 1 | ||
data.nunique(dropna=False) <= 1 | ||
data.nunique() == 1 | ||
data.nunique(dropna=True) == 1 | ||
data.nunique(dropna=False) == 1 | ||
data.nunique() != 1 | ||
data.nunique(dropna=True) != 1 | ||
data.nunique(dropna=False) != 1 | ||
data.nunique() > 1 | ||
data.dropna().nunique() == 1 | ||
data[data.notnull()].nunique() == 1 | ||
|
||
# No violation of this rule | ||
data.nunique() == 0 # empty | ||
data.nunique() >= 1 # not-empty | ||
data.nunique() < 1 # empty | ||
data.nunique() == 2 # not constant | ||
data.unique() == 1 # not `nunique` | ||
|
||
{"hello": "world"}.nunique() == 1 # no pd.Series |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
122 changes: 122 additions & 0 deletions
122
crates/ruff/src/rules/pandas_vet/rules/nunique_constant_series_check.rs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
use num_traits::One; | ||
use rustpython_parser::ast::{self, CmpOp, Constant, Expr, Ranged}; | ||
|
||
use ruff_diagnostics::Diagnostic; | ||
use ruff_diagnostics::Violation; | ||
use ruff_macros::{derive_message_formats, violation}; | ||
|
||
use crate::checkers::ast::Checker; | ||
use crate::rules::pandas_vet::helpers::{test_expression, Resolution}; | ||
|
||
/// ## What it does | ||
/// Check for uses of `.nunique()` to check if a Pandas Series is constant | ||
/// (i.e., contains only one unique value). | ||
/// | ||
/// ## Why is this bad? | ||
/// `.nunique()` is computationally inefficient for checking if a Series is | ||
/// constant. | ||
/// | ||
/// Consider, for example, a Series of length `n` that consists of increasing | ||
/// integer values (e.g., 1, 2, 3, 4). The `.nunique()` method will iterate | ||
/// over the entire Series to count the number of unique values. But in this | ||
/// case, we can detect that the Series is non-constant after visiting the | ||
/// first two values, which are non-equal. | ||
/// | ||
/// In general, `.nunique()` requires iterating over the entire Series, while a | ||
/// more efficient approach allows short-circuiting the operation as soon as a | ||
/// non-equal value is found. | ||
/// | ||
/// Instead of calling `.nunique()`, convert the Series to a NumPy array, and | ||
/// check if all values in the array are equal to the first observed value. | ||
/// | ||
/// ## Example | ||
/// ```python | ||
/// import pandas as pd | ||
/// | ||
/// data = pd.Series(range(1000)) | ||
/// if data.nunique() <= 1: | ||
/// print("Series is constant") | ||
/// ``` | ||
/// | ||
/// Use instead: | ||
/// ```python | ||
/// import pandas as pd | ||
/// | ||
/// data = pd.Series(range(1000)) | ||
/// array = data.to_numpy() | ||
/// if array.shape[0] == 0 or (array[0] == array).all(): | ||
/// print("Series is constant") | ||
/// ``` | ||
/// | ||
/// ## References | ||
/// - [Pandas Cookbook: "Constant Series"](https://pandas.pydata.org/docs/user_guide/cookbook.html#constant-series) | ||
/// - [Pandas documentation: `nunique`](https://pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html) | ||
#[violation] | ||
pub struct PandasNuniqueConstantSeriesCheck; | ||
|
||
impl Violation for PandasNuniqueConstantSeriesCheck { | ||
#[derive_message_formats] | ||
fn message(&self) -> String { | ||
format!("Using `series.nunique()` for checking that a series is constant is inefficient") | ||
} | ||
} | ||
|
||
/// PD101 | ||
pub(crate) fn nunique_constant_series_check( | ||
checker: &mut Checker, | ||
expr: &Expr, | ||
left: &Expr, | ||
ops: &[CmpOp], | ||
comparators: &[Expr], | ||
) { | ||
let ([op], [right]) = (ops, comparators) else { | ||
return; | ||
}; | ||
|
||
// Operators may be ==, !=, <=, >. | ||
if !matches!(op, CmpOp::Eq | CmpOp::NotEq | CmpOp::LtE | CmpOp::Gt,) { | ||
return; | ||
} | ||
|
||
// Right should be the integer 1. | ||
if !is_constant_one(right) { | ||
return; | ||
} | ||
|
||
// Check if call is `.nuniuqe()`. | ||
let Expr::Call(ast::ExprCall { func, .. }) = left else { | ||
return; | ||
}; | ||
|
||
let Expr::Attribute(ast::ExprAttribute { value, attr, .. }) = func.as_ref() else { | ||
return; | ||
}; | ||
|
||
if attr.as_str() != "nunique" { | ||
return; | ||
} | ||
|
||
// Avoid flagging on non-Series (e.g., `{"a": 1}.at[0]`). | ||
if !matches!( | ||
test_expression(value, checker.semantic()), | ||
Resolution::RelevantLocal | ||
) { | ||
return; | ||
} | ||
|
||
checker.diagnostics.push(Diagnostic::new( | ||
PandasNuniqueConstantSeriesCheck, | ||
expr.range(), | ||
)); | ||
} | ||
|
||
/// Return `true` if an [`Expr`] is a constant `1`. | ||
fn is_constant_one(expr: &Expr) -> bool { | ||
match expr { | ||
Expr::Constant(constant) => match &constant.value { | ||
Constant::Int(int) => int.is_one(), | ||
_ => false, | ||
}, | ||
_ => false, | ||
} | ||
} |
122 changes: 122 additions & 0 deletions
122
...s/ruff/src/rules/pandas_vet/snapshots/ruff__rules__pandas_vet__tests__PD101_PD101.py.snap
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
--- | ||
source: crates/ruff/src/rules/pandas_vet/mod.rs | ||
--- | ||
PD101.py:7:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
6 | # PD101 | ||
7 | data.nunique() <= 1 | ||
| ^^^^^^^^^^^^^^^^^^^ PD101 | ||
8 | data.nunique(dropna=True) <= 1 | ||
9 | data.nunique(dropna=False) <= 1 | ||
| | ||
|
||
PD101.py:8:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
6 | # PD101 | ||
7 | data.nunique() <= 1 | ||
8 | data.nunique(dropna=True) <= 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
9 | data.nunique(dropna=False) <= 1 | ||
10 | data.nunique() == 1 | ||
| | ||
|
||
PD101.py:9:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
7 | data.nunique() <= 1 | ||
8 | data.nunique(dropna=True) <= 1 | ||
9 | data.nunique(dropna=False) <= 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
10 | data.nunique() == 1 | ||
11 | data.nunique(dropna=True) == 1 | ||
| | ||
|
||
PD101.py:10:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
8 | data.nunique(dropna=True) <= 1 | ||
9 | data.nunique(dropna=False) <= 1 | ||
10 | data.nunique() == 1 | ||
| ^^^^^^^^^^^^^^^^^^^ PD101 | ||
11 | data.nunique(dropna=True) == 1 | ||
12 | data.nunique(dropna=False) == 1 | ||
| | ||
|
||
PD101.py:11:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
9 | data.nunique(dropna=False) <= 1 | ||
10 | data.nunique() == 1 | ||
11 | data.nunique(dropna=True) == 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
12 | data.nunique(dropna=False) == 1 | ||
13 | data.nunique() != 1 | ||
| | ||
|
||
PD101.py:12:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
10 | data.nunique() == 1 | ||
11 | data.nunique(dropna=True) == 1 | ||
12 | data.nunique(dropna=False) == 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
13 | data.nunique() != 1 | ||
14 | data.nunique(dropna=True) != 1 | ||
| | ||
|
||
PD101.py:13:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
11 | data.nunique(dropna=True) == 1 | ||
12 | data.nunique(dropna=False) == 1 | ||
13 | data.nunique() != 1 | ||
| ^^^^^^^^^^^^^^^^^^^ PD101 | ||
14 | data.nunique(dropna=True) != 1 | ||
15 | data.nunique(dropna=False) != 1 | ||
| | ||
|
||
PD101.py:14:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
12 | data.nunique(dropna=False) == 1 | ||
13 | data.nunique() != 1 | ||
14 | data.nunique(dropna=True) != 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
15 | data.nunique(dropna=False) != 1 | ||
16 | data.nunique() > 1 | ||
| | ||
|
||
PD101.py:15:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
13 | data.nunique() != 1 | ||
14 | data.nunique(dropna=True) != 1 | ||
15 | data.nunique(dropna=False) != 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
16 | data.nunique() > 1 | ||
17 | data.dropna().nunique() == 1 | ||
| | ||
|
||
PD101.py:16:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
14 | data.nunique(dropna=True) != 1 | ||
15 | data.nunique(dropna=False) != 1 | ||
16 | data.nunique() > 1 | ||
| ^^^^^^^^^^^^^^^^^^ PD101 | ||
17 | data.dropna().nunique() == 1 | ||
18 | data[data.notnull()].nunique() == 1 | ||
| | ||
|
||
PD101.py:17:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
15 | data.nunique(dropna=False) != 1 | ||
16 | data.nunique() > 1 | ||
17 | data.dropna().nunique() == 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
18 | data[data.notnull()].nunique() == 1 | ||
| | ||
|
||
PD101.py:18:1: PD101 Using `series.nunique()` for checking that a series is constant is inefficient | ||
| | ||
16 | data.nunique() > 1 | ||
17 | data.dropna().nunique() == 1 | ||
18 | data[data.notnull()].nunique() == 1 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ PD101 | ||
19 | | ||
20 | # No violation of this rule | ||
| | ||
|
||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.