Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug when filling missing values with transform? #14274

Closed
randomgambit opened this issue Sep 21, 2016 · 16 comments
Closed

bug when filling missing values with transform? #14274

randomgambit opened this issue Sep 21, 2016 · 16 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions

Comments

@randomgambit
Copy link

randomgambit commented Sep 21, 2016

Hello there,

Consider this

df = pd.DataFrame({'group' : ['A', 'A', 'A', 'B',
                                              'B', 'B', 'B', 'B'],
                                              'B' : [np.nan,np.nan,np.nan,-4,-2,5,8,7],
                                              'C' : [-5,5,-20,0,np.nan,5,4,-4]})

df
Out[13]: 
     B     C group
0  NaN  -5.0     A
1  NaN   5.0     A
2  NaN -20.0     A
3 -4.0   0.0     B
4 -2.0   NaN     B
5  5.0   5.0     B
6  8.0   4.0     B
7  7.0  -4.0     B

Now I want to fill forward the missing values in C for each group


df.groupby('group').C.fillna(method ='ffill')
Out[11]: 
0    -5.0
1     5.0
2   -20.0
3     0.0
4     0.0
5     5.0
6     4.0
7    -4.0
Name: C, dtype: float64

df.groupby('group').C.transform('ffill')
Out[12]: 
0   -5.0
1   -5.0
2   -5.0
3    5.0
4    5.0
5    5.0
6    5.0
7    5.0
dtype: float64

the transform output is wrong.
Is that expected? Pandas 18.1

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

yeh this is being hit as a fast transformer; the issue is that the function that is called (internally
returns a valid return that is not 1 scalar per group), so this should raise.

@jreback jreback added Bug Groupby Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Sep 21, 2016
@jreback jreback added this to the Next Major Release milestone Sep 21, 2016
@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

some groupby methods by-definition give a transformed output (e.g. they are not a reducer). So these shouldn't be allowed in .transform: e.g. things like: reindex, ffill, bfill, nth (if a list is supplied, though you can't pass that directly anyhow), cum*, diff, shift, head, tail, rank (and maybe more)

e.g.

In [4]: g.diff()
Out[4]: 
0     NaN
1    10.0
2   -25.0
3     NaN
4     NaN
5     NaN
6    -1.0
7    -8.0
Name: C, dtype: float64

In [5]: g.transform('diff')
Out[5]: 
0     NaN
1     NaN
2     NaN
3    10.0
4    10.0
5    10.0
6    10.0
7    10.0
Name: C, dtype: float64

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

xref #9235

@randomgambit
Copy link
Author

randomgambit commented Sep 21, 2016

thanks jeff! although if I remember correctly you suggested me in another post to use
df.groupby('group').mycol.transform('shift',1) and that worked well (faster than with apply). Are you saying that I should not use transform with shift and use instead a good old apply ?

@randomgambit
Copy link
Author

randomgambit commented Sep 22, 2016

here. #4095

but what worries me is that you are saying in this thread that transform should not be used with shift or other functions. What should I do?

@jreback
Copy link
Contributor

jreback commented Sep 22, 2016

.apply('shift', -1) is the correct idiom I was mistaken before

@randomgambit
Copy link
Author

randomgambit commented Sep 22, 2016

ok got it... damn I need to make so many changes in my code... :(
so to sum up:

ONLY USE TRANSFORM WITH FUNCTIONS THAT REDUCE THE DATA

such as first, mean, size, etc

otherwise use apply

Is that correct?

Thx

@chris-b1
Copy link
Contributor

@jreback - I think it may make sense to allow non-reducing transforms from strings? Right now it already works with the cythonized methods, e.g. this works

df.groupby('group').C.transform('shift')

Just would need to catch the other "transforming" methods like ffill, bfill, etc.

@randomgambit
Copy link
Author

randomgambit commented Sep 25, 2016

@chris-b1 I agree and I also noticed shift would work. I think, for end-users like me, the main issue is that the documentation about the differences between apply, transform , aggregate and filter are sometimes not very clear. That is why I wanna write up a tutorial specifically about groupby

@randomgambit
Copy link
Author

randomgambit commented Sep 28, 2016

OK @jreback @chris-b1 now I am confused


df = pd.DataFrame({'group1' : ['A', 'A', 'A', 'A',
                         'B', 'B', 'B', 'B'],
                   'group2' : ['C', 'C', 'C', 'D',
                         'E', 'E', 'F', 'F'],
                   'B' : ['one', np.NaN, np.NaN, np.NaN,
                        np.NaN, 'two', np.NaN, np.NaN],
                   'C' : [np.NaN, 1, np.NaN, np.NaN,
                        np.NaN, np.NaN, np.NaN, 4],
                   'D': [1,2,3,4,5,6,7,8]})          

df
Out[17]: 
     B    C  D group1 group2
0  one  NaN  1      A      C
1  NaN  1.0  2      A      C
2  NaN  NaN  3      A      C
3  NaN  NaN  4      A      D
4  NaN  NaN  5      B      E
5  two  NaN  6      B      E
6  NaN  NaN  7      B      F
7  NaN  4.0  8      B      F

df['lag_C1']=df.groupby('group1').C.apply('shift',1)                        
df['lag_C2']=df.groupby('group1').C.transform('shift',1)                        
df['lag_C3']=df.groupby('group1').C.apply(lambda x: x.shift(1))   

the first version with apply('shift') recommended by @jreback above fails. What is the preferred way to shift a Series fast?

  File "<ipython-input-19-4c47c00fdeb6>", line 1, in <module>
    df['lag_C1']=df.groupby('group1').C.apply('shift',1)

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 645, in apply
    @wraps(func)

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))

AttributeError: 'str' object has no attribute '__module__'

Thanks!

@chris-b1
Copy link
Contributor

df.groupby('group1').C.transform('shift',1)
and
df.groupby('group1').C.shift(1)

Are equivalent (first calls the second) and both take an optimized path.

@randomgambit
Copy link
Author

thanks @chris-b1 ! so yet another version of the same transformation :)

is there a list somewhere of all the legit cythonized functions available in transform and aggregate? That would help reduce the uncertainty as in my post above, where naively using fillna caused terrible damage to my data ;-)

@rhshadrach
Copy link
Member

The output of transform now agrees with fillna on master.

@rhshadrach rhshadrach added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Nov 7, 2020
@geoffrey-eisenbarth
Copy link
Contributor

@rhshadrach I'm interested in helping out with other issues now that my feet are wet.

Are you saying that the only thing needed to close this issue is a test that the fillna and transform operations used in the initial comment produce the same output?

@rhshadrach
Copy link
Member

rhshadrach commented May 15, 2021

@geoffrey-eisenbarth It appears so. I'd also recommend searching the groupby tests for fillna used in transform - perhaps one already exists and this issue wasn't known about.

@mroeschke
Copy link
Member

Looks like #24211 is the same issue and has a unit test so I think we are safe to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants