Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in .ww.describe() that stems from the computation of percentile: will either give KeyError or return the wrong value. #1872

Open
JanetMatsen opened this issue Sep 12, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@JanetMatsen
Copy link

The problem

There is a .iat missing in the definition of percentile that is used by .ww.describe() (see here).

Reproducible example, using woodwork==0.31.0

This demonstrates the bug:

import pandas as pd
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df['new_index'] = [3, 4, 5, 6, 7]     #  Note that for this example, the 1st index value can't be 1.  Use any other integer. 
df.set_index('new_index', inplace=True)
import woodwork as ww
df.ww.init()
df.ww.describe()

The error is KeyError: 1 because it's trying to look up index 1 in the a Series. (Here I've set it to 3).

The bug comes from the woodwork definition of percentile for a pandas series:

import math
def percentile(N, percent, count):                                                                     
    """                                                                                                
    Source: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-nump
                                                                                                       
    Find the percentile of a list of values.                                                           
                                                                                                       
    Args:                                                                                              
        N (pd.Series): Series is a list of values. Note N MUST BE already sorted.                      
        percent (float): float value from 0.0 to 1.0.                                                  
        count (int): Count of values in series                                                         
                                                                                                       
    Returns:                                                                                           
        Value at percentile.                                                                           
    """                                                                                                
    k = (count - 1) * percent                                                                          
    f = math.floor(k)                                                                                  
    c = math.ceil(k)                                                                                   
    if f == c:                                                                                         
        return N[int(k)]        # <-- the stack overflow question was written for lists, and this line needs to be modified for pd.Series.                                                           
    d0 = N.iat[int(f)] * (c - k)                                                                       
    d1 = N.iat[int(c)] * (k - f)                                                                       
    return d0 + d1  

Run this example by setting series = df['a'] and running percentile() directly.

series = df['a']
percentile(N=series, percent=0.25, count=len(series))

Result:

In [3]: series = df['a']
   ...: percentile(N=series, percent=0.25, count=len(series))
   ...:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File ~/.pyenv/versions/3.11.9/envs/240912_woodwork_bug_minimal_example/lib/python3.11/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1

Note that in cases when int(k) happens to be a value in the index it will give you an answer, but there is no reason it will actually correspond to the intended position!

If I'd had

df2 = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df2['new_index'] = [3, 4, 5, 6, 1] 
df2.set_index('new_index', inplace=True) 

then percentile(N=df2['a'], percent=0.25, count=len(series)) gives 5 for the 25th percentile 😱

In [9]: df2 = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
   ...: df2['new_index'] = [3, 4, 5, 6, 1]
   ...: df2.set_index('new_index', inplace=True)

In [10]: df2
Out[10]:
           a
new_index
3          1
4          2
5          3
6          4
1          5

In [11]: percentile(N=df2['a'], percent=0.25, count=len(series))
Out[11]: 5

The fix

The fix is quite simple. Add a .iat like the lines below have.

import math 

# Proposed fix (add .iat to the if f==c case):
def percentile(N, percent, count):                                                                     
    """                                                                                                
    Source: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-nump
                                                                                                       
    Find the percentile of a list of values.                                                           
                                                                                                       
    Args:                                                                                              
        N (pd.Series): Series is a list of values. Note N MUST BE already sorted.                      
        percent (float): float value from 0.0 to 1.0.                                                  
        count (int): Count of values in series                                                         
                                                                                                       
    Returns:                                                                                           
        Value at percentile.                                                                           
    """                                                                                                
    k = (count - 1) * percent                                                                          
    f = math.floor(k)                                                                                  
    c = math.ceil(k)                                                                                   
    if f == c:                                                                                         
        return N.iat[int(k)]      # Fix is just addint .iat                                                                         
    d0 = N.iat[int(f)] * (c - k)                                                                       
    d1 = N.iat[int(c)] * (k - f)                                                                       
    return d0 + d1  

Now it works:

In [5]: series = df['a']
   ...: percentile(N=series, percent=0.25, count=len(series))
Out[5]: 2
@JanetMatsen JanetMatsen added the bug Something isn't working label Sep 12, 2024
@JanetMatsen
Copy link
Author

I tried to push a (one-line) fix to this, but don't seem to have permission to push.

(venv) ➜  woodwork git:(issue1872-bugfix_in_describe_percentile) ✗ 
git push origin issue1872-bugfix_in_describe_percentile
ERROR: Permission to alteryx/woodwork.git denied to JanetMatsen.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@JanetMatsen
Copy link
Author

Hey y'all. I'm hitting this again. Any chance someone can add this tiny fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant