API/BUG: .apply will correctly infer output shape when axis=1 #18577

jreback · 2017-11-30T14:08:17Z

closes #16353
closes #17348
closes #17437
closes #18573
closes #17970
closes #17892
closes #17602
closes #15628
closes #18775
closes #18901
closes #18919

This fixes apply to work correctly when the returned shape mismatches the original. It will try to set the indices if it possible. Setting to a list-like with axis=1 is now disallowed (but still possible if you operate row-wise). We were applying this inconsitently. This is of course a discouraged practice anyhow.

Prob should add some examples / update the doc-string a bit.

codecov · 2017-11-30T15:18:06Z

Codecov Report

Merging #18577 into master will decrease coverage by 0.02%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master   #18577      +/-   ##
==========================================
- Coverage   91.45%   91.43%   -0.03%     
==========================================
  Files         157      157              
  Lines       51378    51392      +14     
==========================================
+ Hits        46987    46988       +1     
- Misses       4391     4404      +13

Flag	Coverage Δ
#multiple	`89.29% <80%> (-0.01%)`	⬇️
#single	`40.56% <36%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.63% <80%> (-0.28%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c40c8f8...dce186a. Read the comment docs.

codecov · 2017-11-30T15:18:22Z

Codecov Report

Merging #18577 into master will decrease coverage by <.01%.
The diff coverage is 94.21%.

@@            Coverage Diff             @@
##           master   #18577      +/-   ##
==========================================
- Coverage    91.6%    91.6%   -0.01%     
==========================================
  Files         150      150              
  Lines       48750    48793      +43     
==========================================
+ Hits        44656    44695      +39     
- Misses       4094     4098       +4

Flag	Coverage Δ
#multiple	`89.97% <94.21%> (ø)`	⬆️
#single	`41.74% <47.93%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/formats/style.py	`96.22% <100%> (ø)`	⬆️
pandas/core/frame.py	`97.43% <100%> (ø)`	⬆️
pandas/core/sparse/frame.py	`94.82% <100%> (-0.03%)`	⬇️
pandas/core/apply.py	`96.77% <93.75%> (-2.66%)`	⬇️
pandas/util/testing.py	`83.85% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 983d71f...1d93380. Read the comment docs.

jreback · 2017-12-01T11:31:08Z

cc @TomAugspurger @jorisvandenbossche @chris-b1
cc @jschendel

jreback · 2017-12-07T11:23:01Z

@TomAugspurger @jorisvandenbossche if you have a chance

chris-b1 · 2017-12-07T15:05:48Z

Unfortunately I'm guessing this will break someone's code in an subtle way, but current behavior is obviously bad, so seems like a net positive.

I've never used it so not sure I fully understand the itention, but this does seem to break the reduce keyword, docstring says "If reduce is True a Series will always be returned, and if False a DataFrame will always be returned."

In [23]: pd.__version__
Out[23]: '0.22.0.dev0+310.gf6f0371'

In [24]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=False)
Out[24]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [25]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=True)
Out[25]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3


In [92]: pd.__version__
Out[92]: '0.21.0'

In [93]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=False)
Out[93]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [94]: df.apply(lambda x: (1, 2, 3), axis=1, reduce=True)
Out[94]: 
0    (1, 2, 3)
1    (1, 2, 3)
2    (1, 2, 3)
3    (1, 2, 3)
4    (1, 2, 3)
5    (1, 2, 3)
dtype: object

jorisvandenbossche · 2017-12-07T15:12:16Z

@jreback I will try to review this tomorrow

jreback · 2017-12-07T18:50:29Z

i think i can simply deprecate reduce

let me see

jreback · 2017-12-08T02:15:01Z

reduce= is deprecated, but I think we should actually figure out the use case and fix it. IIRC its if the passed apply function returns an empty, what do you return (a Series or a DataFrame). but need to sort thru some issues. maybe I will remove this and we can try later. (its the last commit and orthogonal).

jschendel · 2017-12-08T08:16:51Z

doc/source/whatsnew/v0.22.0.txt

+.. ipython:: python
+
+   df = pd.DataFrame([[1,2], [1,2]], columns=['a','b'])
+


Would be nice to display df here to show what it is prior to the apply, similar to what was done in the example above.

jschendel · 2017-12-08T08:21:15Z

doc/source/whatsnew/v0.22.0.txt

+    df.apply(lambda x: [1, 2, 3], axis=1)
+    df.apply(lambda x: [1, 2], axis=1)
+
+The returned input will also *not* return a Series with the list-wrapper as previously.


Could use code formatting here: ``Series``

jschendel · 2017-12-08T08:22:32Z

doc/source/whatsnew/v0.22.0.txt

+   dtype: object
+
+
+New Behavior. The behaviour is consistent.


Nitpick: The second "behaviour" is using British spelling, but the first isn't.

jschendel · 2017-12-08T08:23:18Z

doc/source/whatsnew/v0.22.0.txt

+   dtype: object
+
+
+New Behaviour


Nitpick: British spelling

jschendel · 2017-12-08T08:24:37Z

pandas/core/frame.py

@@ -4818,6 +4818,8 @@ def apply(self, func, axis=0, broadcast=False, raw=False, reduce=None,
            while guessing, exceptions raised by func will be ignored). If
            reduce is True a Series will always be returned, and if False a
            DataFrame will always be returned.
+
+            .. deprecated:: 0.22.0


Should this be mentioned in the whatsnew?

jreback · 2017-12-10T16:26:54Z

I reverted the deprecation off reduce, I can deprecate but its complicated to actually replace it.

jreback · 2017-12-10T16:28:42Z

thanks for comments @jschendel

jorisvandenbossche · 2017-12-11T13:14:55Z

I didn't look at any code yet, but some general comments from experimenting with the branch:

I am big +1 on making those things more consistent, but I think we should provide an alternative to achieve the original (in the cases where it changes), i.e. have a way to specify you want no inference on the result, just put the result as a scalar in a single cell.
We should also add documentation that explains the rules here to make this more predictable.
Column names of the result: using the example of the whatsnew, there is still an inconsistency in the resulting column names:
```
In [1]: df = pd.DataFrame(np.random.randn(4, 3), columns=['A', 'B', 'C'])

In [2]: df.apply(lambda x: [1, 2, 3], axis=1)
Out[2]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3

In [3]: df.apply(lambda x: [1, 2], axis=1)
Out[3]: 
   0  1
0  1  2
1  1  2
2  1  2
3  1  2
```
So depending on the length of the list, it reuses the original column names or not. I would personally be in favor of fixing this inconsistency to not reuse the column names (although that would be an additional breaking change for the first case).
Related, when returning a Series it no longer uses the explicit given names of the result, if the length of the result matches that of the original frame:
```
In [10]: df.apply(lambda x: pd.Series([1, 2], index=['test', 'other']), 1)
Out[10]: 
   test  other
0     1      2
1     1      2
2     1      2
3     1      2

In [11]: df.apply(lambda x: pd.Series([1, 2, 3], index=['test', 'other', 'cols']), 1)
Out[11]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
```
So this last example ([11]) is a regression, as before it correctly returned the specified new names.

raw=True seems to change some of the inference, while, based on the explanation in the docstring, it should only change the values passed to the function and not the inference afterwards on the result of the function.

In [24]: df.apply(lambda x: (1, 2), axis=1)
Out[24]: 
   0  1
0  1  2
1  1  2
2  1  2
3  1  2

In [25]: df.apply(lambda x: (1, 2), axis=1, raw=True)
Out[25]: 
0    (1, 2)
1    (1, 2)
2    (1, 2)
3    (1, 2)
dtype: object

Regarding the example of @chris-b1 above (API/BUG: .apply will correctly infer output shape when axis=1 #18577 (comment)), so for tuples this change is breaking the behaviour always (regardless of the length of the tuple). Before, tuples were always regarded as 'scalars', unless you specified reduce=False.
I am not fully sure I like this change, and as said above I think we need an alternative if keeping to this change.

An illustration of where this "now always try to unpack iterable things like tuples into columns" would cause a problem in a real world use case:

This branch:

In [53]: from shapely.geometry import MultiLineString

In [54]: def f(row):
    ...:     return MultiLineString([[(1,2), (3,4)], [(3,4), (4,2)]])

In [55]: df.apply(f, axis=1)
Out[55]: 
                       0                      1
0  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
1  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
2  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)
3  LINESTRING (1 2, 3 4)  LINESTRING (3 4, 4 2)

with 0.21.1 it preserves the returned objects in a single series (the desired result):

In [94]: df.apply(f, axis=1)
Out[94]: 
0    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
1    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
2    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
3    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
4    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
5    (LINESTRING (1 2, 3 4), LINESTRING (3 4, 4 2))
dtype: object

I also noticed that the behaviour when returning arrays changed (to be consistent with lists, so probably the change is good), but just noticing it to make sure it is tested (previously it raised an error when the shape didn't match (eg lambda x: np.array([1, 2]) with the example above), now it behaves like a list).

jorisvandenbossche · 2017-12-11T13:21:51Z

Contemplating my above comments, I think I would be in favor of (1) "no inference" for general iterable things (objects, tuples, ..), and (2) a clear way to enable / disable inference.

The switch to enable/disable inference could have a default depending on the result type, eg by default infer (so not putting the result as a scalar in a single cell) for Series objects and maybe arrays, and by default no inference for tuples, dicts, generic objects, and probably lists as well. When doing this, I think the main breakage is the case of a "list with length equal to number of columns of original frame", but in that case we could even catch this case and raise a FutureWarning.

Note I didn't look at all examples from the issues yet, so I might be missing some cases.

jreback · 2017-12-11T14:40:24Z

so we need some sort of a parameter to allow inference on the returned result. This is analgous to specifying an output dtype & (shape) in np.vectorize maybe something like:

output=infer|scalar|reduce (and can remove the reduce= & raw= kwargs as well

jorisvandenbossche · 2017-12-14T12:23:03Z

(and can remove the reduce= & raw= kwargs as well

I think raw is still orthogonal (it is to determine which type if passed to the function, not related to what is returned by the function).
But broadcast could maybe be integrated.

jreback · 2018-02-06T23:21:06Z

If we decide to not broadcast a Series to the original layout, I think we should at least raise an error for that case, and not silently ignore the result_type keyword you passed.
But it is a breaking change compared to 0.22 (although, I suppose this is rather a corner case .. returning series and using broadcast=True)

this already raises (an incorrect shape whether the result is a list of Series), added a test. (matches 0.22.0)

jorisvandenbossche · 2018-02-07T10:37:10Z

Yes, if it has an incorrect shape, then it already raises (just like lists). But my example has a correct length, only different index names.

Using your example above:

In [1]: df = pd.DataFrame(np.tile(np.arange(3), 6).reshape(6, -1) + 1, columns=['A', 'B', 'C'])
   ...: 

In [2]: df.apply(lambda x: [1, 2, 3], axis=1, result_type='broadcast')
Out[2]: 
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

In [3]: df.apply(lambda x: Series([1, 2, 3], index=list('abc')), axis=1, result_type='broadcast')
Out[3]: 
   a  b  c
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3
4  1  2  3
5  1  2  3

I think the above is not the desired result. IMO one of the goals of result_type='broadcast should be to ensure that you can be 100% sure what the result will look like (i.e. dataframe with the original column names).
I don't really care about this case, so raising an error instead of preserving the original column names when the returned Series has different index names (the current behaviour on master for broadcast=True) is fine for me as well. I just think we should not silently ignore result_type='broadcast'. If you want the result as shown above, that is what the default of result_type=None already does, as well as the explicit result_type='expand'.

closes pandas-dev#16353 closes pandas-dev#17348 closes pandas-dev#17437 closes pandas-dev#18573 closes pandas-dev#17970 closes pandas-dev#17892 closes pandas-dev#17602 closes pandas-dev#18775 closes pandas-dev#18901 closes pandas-dev#18919

jreback · 2018-02-07T10:55:42Z

you think that [3] should actually be [2], IOW we ignore if the resulting index (if any) on a broadcastable? ok I think that's fine.

jreback · 2018-02-07T11:00:07Z

ok fixed up.

jorisvandenbossche · 2018-02-07T13:07:48Z

I think that is our record of closed issues in one go :-)

Thanks @jreback !

jorisvandenbossche · 2018-02-07T14:20:44Z

I opened some follow-up issues: #19570, #19571
And a PR with some more doc clean-ups: #19573

…-dev#18577) closes pandas-dev#16353 closes pandas-dev#17348 closes pandas-dev#17437 closes pandas-dev#18573 closes pandas-dev#17970 closes pandas-dev#17892 closes pandas-dev#17602 closes pandas-dev#18775 closes pandas-dev#18901 closes pandas-dev#18919

…ndas-dev#19573)

jreback added API Design Apply Apply, Aggregate, Transform, Map Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 30, 2017

jreback added this to the 0.22.0 milestone Nov 30, 2017

jreback force-pushed the apply branch 3 times, most recently from c369687 to f6f0371 Compare December 7, 2017 11:19

jreback force-pushed the apply branch from f6f0371 to 7105339 Compare December 8, 2017 00:20

jschendel reviewed Dec 8, 2017

View reviewed changes

doc/source/whatsnew/v0.22.0.txt Outdated

dtype: object

New Behaviour

Copy link

Member

jschendel Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: British spelling

jschendel reviewed Dec 8, 2017

View reviewed changes

jreback force-pushed the apply branch from 7105339 to 38c40f4 Compare December 10, 2017 16:26

jreback mentioned this pull request Dec 13, 2017

CLN: factor apply out of frame.py #18754

Merged

jreback added 11 commits February 7, 2018 05:53

tests for dicts and result_type='infer'

534e65d

clean up broadcast tests

24c1b6e

rename infer to expand

4789970

deprecate reduce

914e136

validate result_type kwarg

ad9cbd9

docs

6e1819a

doc

1363c03

broadcast should preserve like-columns

dc6ed74

update doc-string

69084ea

test for Series with broadcast=True, raises when passed incorrect shape

8a1837c

jreback force-pushed the apply branch from 922bbc4 to 1d93380 Compare February 7, 2018 10:59

broadcast always returns original column names

1d93380

jorisvandenbossche merged commit 6b0c7e7 into pandas-dev:master Feb 7, 2018

This was referenced Feb 7, 2018

API: should apply also follow result_type for axis=0 ? #19570

Open

API: should apply(..., result_type='reduce') be honored for Series return value #19571

Open

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Feb 7, 2018

DOC: some clean-up of the apply docs (follow-up pandas-dev#18577)

552cc43

jorisvandenbossche mentioned this pull request Feb 7, 2018

DOC: some clean-up of the apply docs (follow-up #18577) #19573

Merged

jreback pushed a commit that referenced this pull request Feb 8, 2018

DOC: some clean-up of the apply docs (follow-up #18577) (#19573)

d0c13a3

jorisvandenbossche mentioned this pull request Feb 23, 2018

API: update Series.apply to follow changes in DataFrame.apply #19866

Open

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018

DOC: some clean-up of the apply docs (follow-up pandas-dev#18577) (pa…

a10f2e0

…ndas-dev#19573)

jbrockmendel mentioned this pull request Oct 15, 2019

DEPR: remove previously-deprecated broadcast/reduce kwargs from DataFrame.apply #29017

Merged

jreback mentioned this pull request Nov 20, 2019

DEPR: deprecations log for removed issues #13777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/BUG: .apply will correctly infer output shape when axis=1 #18577

API/BUG: .apply will correctly infer output shape when axis=1 #18577

jreback commented Nov 30, 2017 •

edited

Loading

codecov bot commented Nov 30, 2017 •

edited

Loading

codecov bot commented Nov 30, 2017 •

edited

Loading

jreback commented Dec 1, 2017

jreback commented Dec 7, 2017

chris-b1 commented Dec 7, 2017

jorisvandenbossche commented Dec 7, 2017

jreback commented Dec 7, 2017

jreback commented Dec 8, 2017

jschendel Dec 8, 2017

jschendel Dec 8, 2017

jschendel Dec 8, 2017 •

edited

Loading

jschendel Dec 8, 2017

jschendel Dec 8, 2017

jreback commented Dec 10, 2017

jreback commented Dec 10, 2017

jorisvandenbossche commented Dec 11, 2017

jorisvandenbossche commented Dec 11, 2017

jreback commented Dec 11, 2017 •

edited

Loading

jorisvandenbossche commented Dec 14, 2017

jreback commented Feb 6, 2018 •

edited

Loading

jorisvandenbossche commented Feb 7, 2018

jreback commented Feb 7, 2018

jreback commented Feb 7, 2018

jorisvandenbossche commented Feb 7, 2018

jorisvandenbossche commented Feb 7, 2018

		.. ipython:: python

		df = pd.DataFrame([[1,2], [1,2]], columns=['a','b'])

API/BUG: .apply will correctly infer output shape when axis=1 #18577

API/BUG: .apply will correctly infer output shape when axis=1 #18577

Conversation

jreback commented Nov 30, 2017 • edited Loading

codecov bot commented Nov 30, 2017 • edited Loading

Codecov Report

codecov bot commented Nov 30, 2017 • edited Loading

Codecov Report

jreback commented Dec 1, 2017

jreback commented Dec 7, 2017

chris-b1 commented Dec 7, 2017

jorisvandenbossche commented Dec 7, 2017

jreback commented Dec 7, 2017

jreback commented Dec 8, 2017

jschendel Dec 8, 2017

Choose a reason for hiding this comment

jschendel Dec 8, 2017

Choose a reason for hiding this comment

jschendel Dec 8, 2017 • edited Loading

Choose a reason for hiding this comment

jschendel Dec 8, 2017

Choose a reason for hiding this comment

jschendel Dec 8, 2017

Choose a reason for hiding this comment

jreback commented Dec 10, 2017

jreback commented Dec 10, 2017

jorisvandenbossche commented Dec 11, 2017

jorisvandenbossche commented Dec 11, 2017

jreback commented Dec 11, 2017 • edited Loading

jorisvandenbossche commented Dec 14, 2017

jreback commented Feb 6, 2018 • edited Loading

jorisvandenbossche commented Feb 7, 2018

jreback commented Feb 7, 2018

jreback commented Feb 7, 2018

jorisvandenbossche commented Feb 7, 2018

jorisvandenbossche commented Feb 7, 2018

jreback commented Nov 30, 2017 •

edited

Loading

codecov bot commented Nov 30, 2017 •

edited

Loading

codecov bot commented Nov 30, 2017 •

edited

Loading

jschendel Dec 8, 2017 •

edited

Loading

jreback commented Dec 11, 2017 •

edited

Loading

jreback commented Feb 6, 2018 •

edited

Loading