-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add a Series method which checks whether a Series is constant #58806
Comments
I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable. if v.shape[0] != 0:
is_constant = (s[0] == s).all()
else:
is_constant = True There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue. |
This will be a great addition |
The |
The proposed array = s.dropna().values
is_constant = array.shape[0] == 0 or (array[0] == array).all() I posted my finding here: astral-sh/ruff#11910. However, this solution is still O(N) and not short-circuiting. For large Series that are non-constant with high likelihood, naive python code can be orders of magnitude faster. import pandas as pd
import numpy as np
def is_constant(array):
if len(array) <= 1:
return True
first = array[0]
return all(item == first for item in array)
const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values
%timeit is_constant(const) # 72.7 ms ± 1.66 ms
%timeit (const[0] == const).all() # 144 µs ± 2.3 µs
%timeit is_constant(irreg) # 968 ns ± 6.7 ns
%timeit (irreg[0] == irreg).all() # 129 µs ± 132 ns With numba-jit we can further drastically improve the performance import pandas as pd
import numpy as np
import numba
@numba.njit
def is_constant(array):
if len(array) <= 1:
return True
first = array[0]
for item in array:
if item != first:
return False
return True
const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values
%timeit is_constant(const) # 457 µs ± 5.42 µs (instead of 72 ms)
%timeit (const[0] == const).all() # 136 µs ± 311 ns
%timeit is_constant(irreg) # 242 ns ± 1.52 ns (instead of 968 ns)
%timeit (irreg[0] == irreg).all() # 128 µs ± 2.15 µs |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
In the cookbook, a recipe is given for checking that a Series only contains constant values in a performant way:
https://pandas.pydata.org/docs/user_guide/cookbook.html#constant-series
is_constant = v.shape[0] == 0 or (s[0] == s).all()
To me, this has poor readability and is difficult to learn as an idiom because it requires the programmer to remember to check the edge case of
.shape[0] == [0]
, and to remember to check the cases of missing values / NaN values, which need to be handled differently (as explained in the cookbook).Feature Description
It would be nice to have a convenience function which provided a performant
is_constant
check on aSeries
.It could have optional arguments to configure how missing values are handled.
Alternative Solutions
The alternative is just to require the user to detect the poorly performant code, possibly automatically with a linter (see below), and come up with a performant solution for their case, possibly using the cookbook. Otherwise, the simple
.nunique(dropna=...) <= 1
solution is convenient enough for when performance is not a concern.Additional Context
I came across this when using a
pandas-vet
rule viaruff
: PD101I like the linter to detect performance issues like this one; but I prefer that they don't harm readability if possible.
The text was updated successfully, but these errors were encountered: