bug when filling missing values with transform? #14274

randomgambit · 2016-09-21T19:52:48Z

Hello there,

Consider this

df = pd.DataFrame({'group' : ['A', 'A', 'A', 'B',
                                              'B', 'B', 'B', 'B'],
                                              'B' : [np.nan,np.nan,np.nan,-4,-2,5,8,7],
                                              'C' : [-5,5,-20,0,np.nan,5,4,-4]})

df
Out[13]: 
     B     C group
0  NaN  -5.0     A
1  NaN   5.0     A
2  NaN -20.0     A
3 -4.0   0.0     B
4 -2.0   NaN     B
5  5.0   5.0     B
6  8.0   4.0     B
7  7.0  -4.0     B

Now I want to fill forward the missing values in C for each group


df.groupby('group').C.fillna(method ='ffill')
Out[11]: 
0    -5.0
1     5.0
2   -20.0
3     0.0
4     0.0
5     5.0
6     4.0
7    -4.0
Name: C, dtype: float64

df.groupby('group').C.transform('ffill')
Out[12]: 
0   -5.0
1   -5.0
2   -5.0
3    5.0
4    5.0
5    5.0
6    5.0
7    5.0
dtype: float64

the transform output is wrong.
Is that expected? Pandas 18.1

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-21T22:21:05Z

yeh this is being hit as a fast transformer; the issue is that the function that is called (internally
returns a valid return that is not 1 scalar per group), so this should raise.

jreback · 2016-09-21T22:25:42Z

some groupby methods by-definition give a transformed output (e.g. they are not a reducer). So these shouldn't be allowed in .transform: e.g. things like: reindex, ffill, bfill, nth (if a list is supplied, though you can't pass that directly anyhow), cum*, diff, shift, head, tail, rank (and maybe more)

e.g.

In [4]: g.diff()
Out[4]: 
0     NaN
1    10.0
2   -25.0
3     NaN
4     NaN
5     NaN
6    -1.0
7    -8.0
Name: C, dtype: float64

In [5]: g.transform('diff')
Out[5]: 
0     NaN
1     NaN
2     NaN
3    10.0
4    10.0
5    10.0
6    10.0
7    10.0
Name: C, dtype: float64

jreback · 2016-09-21T22:31:59Z

xref #9235

randomgambit · 2016-09-21T22:57:24Z

thanks jeff! although if I remember correctly you suggested me in another post to use
df.groupby('group').mycol.transform('shift',1) and that worked well (faster than with apply). Are you saying that I should not use transform with shift and use instead a good old apply ?

randomgambit · 2016-09-22T00:16:26Z

here. #4095

but what worries me is that you are saying in this thread that transform should not be used with shift or other functions. What should I do?

jreback · 2016-09-22T10:41:18Z

.apply('shift', -1) is the correct idiom I was mistaken before

randomgambit · 2016-09-22T11:58:45Z

ok got it... damn I need to make so many changes in my code... :(
so to sum up:

ONLY USE TRANSFORM WITH FUNCTIONS THAT REDUCE THE DATA

such as first, mean, size, etc

otherwise use apply

Is that correct?

Thx

chris-b1 · 2016-09-25T14:35:54Z

@jreback - I think it may make sense to allow non-reducing transforms from strings? Right now it already works with the cythonized methods, e.g. this works

df.groupby('group').C.transform('shift')

Just would need to catch the other "transforming" methods like ffill, bfill, etc.

randomgambit · 2016-09-25T17:32:00Z

@chris-b1 I agree and I also noticed shift would work. I think, for end-users like me, the main issue is that the documentation about the differences between apply, transform , aggregate and filter are sometimes not very clear. That is why I wanna write up a tutorial specifically about groupby

randomgambit · 2016-09-28T19:59:32Z

OK @jreback @chris-b1 now I am confused


df = pd.DataFrame({'group1' : ['A', 'A', 'A', 'A',
                         'B', 'B', 'B', 'B'],
                   'group2' : ['C', 'C', 'C', 'D',
                         'E', 'E', 'F', 'F'],
                   'B' : ['one', np.NaN, np.NaN, np.NaN,
                        np.NaN, 'two', np.NaN, np.NaN],
                   'C' : [np.NaN, 1, np.NaN, np.NaN,
                        np.NaN, np.NaN, np.NaN, 4],
                   'D': [1,2,3,4,5,6,7,8]})          

df
Out[17]: 
     B    C  D group1 group2
0  one  NaN  1      A      C
1  NaN  1.0  2      A      C
2  NaN  NaN  3      A      C
3  NaN  NaN  4      A      D
4  NaN  NaN  5      B      E
5  two  NaN  6      B      E
6  NaN  NaN  7      B      F
7  NaN  4.0  8      B      F

df['lag_C1']=df.groupby('group1').C.apply('shift',1)                        
df['lag_C2']=df.groupby('group1').C.transform('shift',1)                        
df['lag_C3']=df.groupby('group1').C.apply(lambda x: x.shift(1))

the first version with apply('shift') recommended by @jreback above fails. What is the preferred way to shift a Series fast?

  File "<ipython-input-19-4c47c00fdeb6>", line 1, in <module>
    df['lag_C1']=df.groupby('group1').C.apply('shift',1)

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 645, in apply
    @wraps(func)

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))

AttributeError: 'str' object has no attribute '__module__'

Thanks!

chris-b1 · 2016-09-28T21:14:32Z

df.groupby('group1').C.transform('shift',1)
and
df.groupby('group1').C.shift(1)

Are equivalent (first calls the second) and both take an optimized path.

randomgambit · 2016-09-28T21:20:18Z

thanks @chris-b1 ! so yet another version of the same transformation :)

is there a list somewhere of all the legit cythonized functions available in transform and aggregate? That would help reduce the uncertainty as in my post above, where naively using fillna caused terrible damage to my data ;-)

rhshadrach · 2020-11-07T15:41:06Z

The output of transform now agrees with fillna on master.

geoffrey-eisenbarth · 2021-05-15T00:44:20Z

@rhshadrach I'm interested in helping out with other issues now that my feet are wet.

Are you saying that the only thing needed to close this issue is a test that the fillna and transform operations used in the initial comment produce the same output?

rhshadrach · 2021-05-15T02:22:45Z

@geoffrey-eisenbarth It appears so. I'd also recommend searching the groupby tests for fillna used in transform - perhaps one already exists and this issue wasn't known about.

mroeschke · 2021-10-31T02:29:16Z

Looks like #24211 is the same issue and has a unit test so I think we are safe to close

jreback added Bug Groupby Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Sep 21, 2016

jreback added this to the Next Major Release milestone Sep 21, 2016

ghost mentioned this issue Jun 14, 2019

Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

Closed

ghost mentioned this issue Jul 14, 2019

Discuss: transformation vs. aggregation in agg vs. transform #27389

Closed

ghost mentioned this issue Jul 25, 2019

BUG: groupby.transform(name) validates name is an aggregation #27597

Closed

2 tasks

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

rhshadrach added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Nov 7, 2020

mroeschke closed this as completed Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug when filling missing values with transform? #14274

bug when filling missing values with transform? #14274

randomgambit commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016

randomgambit commented Sep 21, 2016 •

edited

Loading

randomgambit commented Sep 22, 2016 •

edited

Loading

jreback commented Sep 22, 2016

randomgambit commented Sep 22, 2016 •

edited

Loading

chris-b1 commented Sep 25, 2016

randomgambit commented Sep 25, 2016 •

edited

Loading

randomgambit commented Sep 28, 2016 •

edited

Loading

chris-b1 commented Sep 28, 2016

randomgambit commented Sep 28, 2016

rhshadrach commented Nov 7, 2020

geoffrey-eisenbarth commented May 15, 2021

rhshadrach commented May 15, 2021 •

edited

Loading

mroeschke commented Oct 31, 2021

bug when filling missing values with transform? #14274

bug when filling missing values with transform? #14274

Comments

randomgambit commented Sep 21, 2016 • edited Loading

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016 • edited Loading

jreback commented Sep 21, 2016

randomgambit commented Sep 21, 2016 • edited Loading

randomgambit commented Sep 22, 2016 • edited Loading

jreback commented Sep 22, 2016

randomgambit commented Sep 22, 2016 • edited Loading

chris-b1 commented Sep 25, 2016

randomgambit commented Sep 25, 2016 • edited Loading

randomgambit commented Sep 28, 2016 • edited Loading

chris-b1 commented Sep 28, 2016

randomgambit commented Sep 28, 2016

rhshadrach commented Nov 7, 2020

geoffrey-eisenbarth commented May 15, 2021

rhshadrach commented May 15, 2021 • edited Loading

mroeschke commented Oct 31, 2021

randomgambit commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016 •

edited

Loading

randomgambit commented Sep 21, 2016 •

edited

Loading

randomgambit commented Sep 22, 2016 •

edited

Loading

randomgambit commented Sep 22, 2016 •

edited

Loading

randomgambit commented Sep 25, 2016 •

edited

Loading

randomgambit commented Sep 28, 2016 •

edited

Loading

rhshadrach commented May 15, 2021 •

edited

Loading