Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add a Series method which checks whether a Series is constant #58806

Open
1 of 3 tasks
nathanjmcdougall opened this issue May 22, 2024 · 4 comments
Open
1 of 3 tasks
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@nathanjmcdougall
Copy link

nathanjmcdougall commented May 22, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

In the cookbook, a recipe is given for checking that a Series only contains constant values in a performant way:

https://pandas.pydata.org/docs/user_guide/cookbook.html#constant-series

is_constant = v.shape[0] == 0 or (s[0] == s).all()

To me, this has poor readability and is difficult to learn as an idiom because it requires the programmer to remember to check the edge case of .shape[0] == [0], and to remember to check the cases of missing values / NaN values, which need to be handled differently (as explained in the cookbook).

Feature Description

It would be nice to have a convenience function which provided a performant is_constant check on a Series.

It could have optional arguments to configure how missing values are handled.

Alternative Solutions

The alternative is just to require the user to detect the poorly performant code, possibly automatically with a linter (see below), and come up with a performant solution for their case, possibly using the cookbook. Otherwise, the simple .nunique(dropna=...) <= 1 solution is convenient enough for when performance is not a concern.

Additional Context

I came across this when using a pandas-vet rule via ruff: PD101

I like the linter to detect performance issues like this one; but I prefer that they don't harm readability if possible.

@nathanjmcdougall nathanjmcdougall added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2024
@Aloqeely
Copy link
Member

I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable.
Code readability is subjective, but you can use an if statement to make it more readable (although it's a bit redundant):

if v.shape[0] != 0:
    is_constant = (s[0] == s).all()
else:
    is_constant = True

There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue.

@Aloqeely Aloqeely added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2024
@PushpitSB
Copy link

This will be a great addition

@miguelpgarcia
Copy link

miguelpgarcia commented Jun 16, 2024

I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable. Code readability is subjective, but you can use an if statement to make it more readable (although it's a bit redundant):

if v.shape[0] != 0:
    is_constant = (s[0] == s).all()
else:
    is_constant = True

There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue.

The is_unique function is also concise, consisting of just one line of code. Adding this as an official method, rather than leaving it as a recipe, may enhance code consistency (having both is_unique and is_constant methods) and guide users towards a more performant option.

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jun 17, 2024

The proposed (s[0] == s).all() is error-prone in edge cases (What if s[0] is NaN? What if s is empty?), and actually slower for small Series. Going this route, one should do a .dropna() and .values/.array first.

array = s.dropna().values
is_constant = array.shape[0] == 0 or (array[0] == array).all()

I posted my finding here: astral-sh/ruff#11910. However, this solution is still O(N) and not short-circuiting. For large Series that are non-constant with high likelihood, naive python code can be orders of magnitude faster.

import pandas as pd
import numpy as np

def is_constant(array):
    if len(array) <= 1:
        return True
    first = array[0]
    return all(item == first for item in array)

const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values

%timeit is_constant(const)         # 72.7 ms ± 1.66 ms 
%timeit (const[0] == const).all()  # 144 µs ± 2.3 µs
%timeit is_constant(irreg)         # 968 ns ± 6.7 ns
%timeit (irreg[0] == irreg).all()  # 129 µs ± 132 ns

With numba-jit we can further drastically improve the performance

import pandas as pd
import numpy as np
import numba

@numba.njit
def is_constant(array):
    if len(array) <= 1:
        return True
    first = array[0]
    for item in array:
        if item != first:
            return False
    return True

const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values

%timeit is_constant(const)         # 457 µs ± 5.42 µs  (instead of 72 ms)
%timeit (const[0] == const).all()  # 136 µs ± 311 ns
%timeit is_constant(irreg)         # 242 ns ± 1.52 ns  (instead of 968 ns)
%timeit (irreg[0] == irreg).all()  # 128 µs ± 2.15 µs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants