API: make the func in Series.apply always operate on the Series #52140

topper-123 · 2023-03-23T18:33:34Z

I've lately worked on making Series.map simpler as part of implementing the na_action on all ExtensionArray.map methods. As part of that, I made #52033. That PR (and the current SeriesApply.apply_standard more generally) very clearly shows how Series.apply & Series.map are very similar, but different enough for it to be confusing when it's a good idea to use one over the other and when Series.apply especially is a bad idea to use.

I propose doing some changes in how Series.apply work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.

I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.

Similarities and differences between `Series.apply` and `Series.map`

The similarity between the methods is especially that they both fall back to use Series._map_values and there use algorithms.map_array or ExtensionArray.map as relevant.

The differences are many, but each one is relative minor:

Series.apply has a convert_dtype parameter, which Series.map doesn't
Series.map has a na_action parameter, which Series.apply doesn't
Series.apply can take advantage of numpy ufuncs, which Series.map can't
Series.apply can take args and **kwargs, which Series.map can't
Series.apply will return a Dataframe, if its result is a listlike of Series, which Series.map won't
Series.apply is more general and can take a string, e.g. "sum", or lists or dicts of inputs which Series.map can't.

Also, Series.apply is a bit of a parent method of Series.agg & Series.transform.

The problems

The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use .apply over .map to do operations, and vica versa. I will show some examples below.

First some setup:

>>> import numpy as np
>>> import pandas as pd 
>>>
>>> small_ser = pd.Series([1, 2, 3])
>>> large_ser = pd.Series(range(100_000))

1: string vs numpy funcs in `Series.apply`

>>> small_ser.apply("sum")
6
>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64

It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.

Note that giving np.sum to DataFrame.apply aggregates properly:

>>> small_ser.to_frame().apply(np.sum)
0    6
dtype: int64

1.5 Callables vs. list/dict of callables (added 2023-04-07)

>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.apply([np.sum])
sum    6
dtype: int64

Also with non-numpy callables:

>>> small_ser.apply(lambda x: x.sum())
AttributeError: 'int' object has no attribute 'sum'
>>> small_ser.apply([lambda x: x.sum()])
<lambda>    6
dtype: int64

In both cases above the difference is that Series.apply operates element-wise, if given a callable, but series-wise if given a list/dict of callables.

2. Functions in `Series.apply` (& `Series.transform`)

The Series.apply doc string have examples with using lambdas, but lambdas in Series.apply is a bad practices because of bad performance:

>>> %timeit large_ser.apply(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop

Currently, Series does not have a method that makes a callable operate on a series' data. Instead users need to use Series.pipe for that operation in order for the operation to be efficient:

>>> %timeit large_ser.pipe(lambda x: x + 1)
44 µs ± 363 ns per loop

(The reason for the above performance differences is that apply gets called on each single element, while pipe calls x.__add__(1), which operates on the whole array).

Note also that .pipe operates on the Series while applycurrently operates on each element in the data, so there is some differences that may have some consequence in some cases.

Also notice that Series.transform has the same performance problems:

>>> %timeit large_ser.transform(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop

3. ufuncs in `Series.apply` vs. in `Series.map`

Performance-wise, ufuncs are fine in Series.apply, but not in Series.map:

>>> %timeit large_ser.apply(np.sqrt)
71.6 µs ± 1.17 µs per loop
>>> %timeit large_ser.map(np.sqrt)
63.9 ms ± 69.5 µs per loop

It's difficult for users to understand why one is fast and the other slow (answer: only apply correctly works with ufuncs).

It is also difficult to understand why ufuncs are fast in apply, while other callables are slow in apply (answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).

4. callables in `Series.apply` is bad, callables in `SeriesGroupby.apply` is fine

I showed above that using (non-ufunc) callables in Series.apply is bad performancewise. OTOH using them in SeriesGroupby.apply is fine:

>>> %timeit large_ser.apply(lambda x: x + 1)
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit large_ser.groupby(large_ser > 50_000).apply(lambda x: x + 1)
11.3 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the apply op is much larger, and similar to example 2 above.

Having callables being ok to use in the SeriesGroupby.apply method, but not in the Series.Apply is confusing IMO.

5: callables in `Series.apply` that return Series transform data to a DataFrame

Series.apply has an exception that if the callable returns a list-like of Series, the Series will be concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:

>>> small_ser.apply(lambda x: pd.Series([x, x+1], index["a", "b"]))
   a   b
0  0   1
1  1   2
2  2   3
>>> %timeit large_ser.apply(lambda x: pd.Series([x, x+1]))
# timing takes too long to measure

It's probably never a good idea to use this pattern, and e.g. .pipe is much faster, so e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1})) will be much faster. If we really do need operation on single element in that fashion it is still possible using pipe, e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func))) and also just directly pd.DataFrame({"a": large_ser, "b": large_ser.map(some_func))).

So giving callables that return Series to Series.apply is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).

6. `Series.apply` vs. `Series.agg`

The doc string for Series.agg says about the method's func parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:

>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.agg(np.sum)
6

You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to agg and to apply".

7. dictlikes vs. listlikes in `Series.apply` (added 2023-06-04)

Giving a list of transforming arguments to Series.apply returns a DataFrame:

>>> small_ser.apply(["sqrt", np.abs])
       sqrt  absolute
0  1.000000         1
1  1.414214         2
2  1.732051         3

But giving a dict of transforming arguments returns a Series with a MultiIndex:

>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
sqrt  0    1.000000
      1    1.414214
      2    1.732051
abs   0    1.000000
      1    2.000000
      2    3.000000
dtype: float64

These two should give same-shaped output for consistency. Using Series.transform instead of Series.apply, it returns a DataFrame in both cases and I think the dictlike example above should return a DataFrame similar to the listlike example.

Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using apply.

Proposal

With the above in mind, I propose that:

Series.apply takes callables that always operate on the series. I.e. let series.apply(func) be similar to func(series) + the needed additional functionality.
Series.map takes callables that operate on each element individually. I.e. series.map(func) will be similar to the current series._map_values(func) + the needed additional functionality.
The parameter convert_dtype will be deprecated in Series.apply (already done in DEPR: Deprecate the convert_dtype param in Series.Apply #52257).
A parameter convert_dtype will NOT be added to Series.map (comment) by @rhshadrach).
The ability in Series.apply to convert a list[Series] to a DataFrame will be deprecated (already done in DEPR: Deprecate returning a DataFrame in SeriesApply.apply_standard #52123).
The ability to convert a list[Series] to a DataFrame will NOT be added to Series.map.
The changes made to Series.applywill propagate to Series.agg and Series.transform.

The difference between Series.apply() & Series.map() will then be that:

Series.apply() makes the passed-in callable operate on the series, similarly to how (DataFrame|SeriesGroupby|DataFrameGroupBy).apply. operate on series. This is very fast and can do almost anything,
Series.map() makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used if Series.apply can't do it.

so, IMO, this API change will help make Pandas Series.(apply|map) API simpler without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.

Deprecation process

The cumbersome part of the deprecation process will be to change Series.apply to only work array-wise, ie. to do func(series._values) always. This can be done by adding an array_ops_only parameter to Series.apply, so:

>>> def apply(self, ..., array_ops_only: bool | NoDefault=no_default, ...):
    if array_ops_only is no_default:
        warn("....")
        array_ops_only = False
    ...

and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.

The other changes are more easy: convert_dtype in Series.apply will be deprecated just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-03-24T13:22:14Z

Somewhat related: #49673. But I think these changes make sense in combination with that issue.

I think this sounds great - I've always disliked the difference between SeriesGroupBy.apply and Series.apply and this also clears that up.

If we were to implement #35725 (comment), then I think there would be no difference between Series.agg and Series.apply (but I'm not certain of this). I don't think that's an issue - I'd rather have consistent behavior even if the implementations collide in the 1-d case.

4. A parameter convert_dtype will be added to Series.map.

Is this necessary? I think this is just ser.map(...).convert_dtypes(), which is not much more verbose than using an argument.

This isn't necessary, but we could also rename DataFrame.applymap to DataFrame.map as this would have a similar meaning to Series.map (act on values individually).

I suspect the deprecation of this will not be straightforward because apply internally sometimes uses agg and agg sometimes uses apply. This would be fixed by #49673.

topper-123 · 2023-03-24T21:38:27Z

Yes my thought is that this is related than #49673, but not the same. Both will make the code base clearer.

4. A parameter convert_dtype will be added to Series.map.

Is this necessary? I think this is just ser.map(...).convert_dtypes(), which is not much more verbose than using an argument.

I don't have a super strong opinion about this parameter, maybe it's superflous, it's just that if we keep it, it fits better in Series.map than in Series.apply in the new setup. But notice also that .apply(convert_dtype=...) is not related to ser.apply().convert_dtypes(), they're entirely different concepts. series.apply(convert_dtype=False) just means that pandas won't try to convert the internal ndarry(dtype=object) to the dtype of the calling series.

This isn't necessary, but we could also rename DataFrame.applymap to DataFrame.map as this would have a similar meaning to Series.map (act on values individually).

If the reception to this PR was positive, I was actually planning on following up with proposing to change DataFrame.applymap -> DataFrame.map. Nice that you agree with that.

I suspect the deprecation of this will not be straightforward because apply internally sometimes uses agg and agg sometimes uses apply. This would be fixed by #49673.

Possibly yeah. But it could also go the other way (fixing this would make #49673 easier:-)). Both work towards the same goal (simpler code paths) from different directions, so there could be mutual benefits in both directions.

topper-123 · 2023-03-27T22:02:45Z

@pandas-dev/pandas-core, any comments/objections? I'll start on this soon and would appreciate some early comments, if you see things differently than me...

WillAyd · 2023-03-28T20:08:29Z

Series.apply takes callables that always operate on the array. I.e. let series.apply(func) be similar to func(series._values)

This seems nice from a technical perspective but I don't know that a lot of users know/care much about ._values; I'm personally not even sure of all the rules for what ._values brings.

I'm guessing apply is used more often. Not blocking any progression on this, but I have a concern about setting expectations that end users have some knowledge of the internal value storage

rhshadrach · 2023-03-28T20:56:47Z

Thanks @WillAyd - I missed this. Shouldn't series.apply(func) operate as func(series)? I believe all other methods like this - apply (in groupby), agg, transform - act on pandas objects (Series / DataFrame).

topper-123 · 2023-03-29T16:03:41Z

You are right @WillAyd, I just checked for DataFrame.apply and it does indeed work on series and not series data.

My intention was to make apply operate the same way for Series as for other data structures, and somehow thought the others operated on values. So of course, if the other operate on Series, IMO this one should too. This will still also archieve the goal of having .apply operate on the whole series (i.e. it's fast), while .map operates elementwise (i.e. is potentially slow).

Dr-Irv · 2023-03-29T17:07:28Z

@pandas-dev/pandas-core, any comments/objections? I'll start on this soon and would appreciate some early comments, if you see things differently than me...

I think this warrants a PDEP. Right now, Series.apply(lambda x: x+1) applies the function to each element of a series. This can be very useful. If you change it so that the callable is applied to entire series, then the result could be ambiguous.

topper-123 · 2023-03-29T19:57:05Z

Series.apply(lambda x: x+1) applies the function to each element of a series. This can be very useful. If you change it so that the callable is applied to entire series, then the result could be ambiguous.

Series.apply(lambda x: x + 1) is already synonymous with Series.map(lambda x: x+1) today, so we'd just direct users to use Series.map if they want to operate on each element. In practice, they'll probably actually want to do the series-operation, because that is so much faster while giving the same result in many (most?) cases.

The deprecation process would of course need to be clear in this regard, so that will be part of the PR.

Dr-Irv · 2023-03-29T20:26:23Z

The deprecation process would of course need to be clear in this regard, so that will be part of the PR.

That's why I think this should be a PDEP.

I think part of the issue here is that the string version of the arguments (e.g., "sum", "min") operate differently than the ufuncs ('np.sum', 'np.min')

rhshadrach · 2023-03-30T20:18:55Z

That's why I think this should be a PDEP.

@Dr-Irv - can you give a bit more detail here? What about the deprecation process makes you think this? Is it just that expectation that Series.apply is likely used by many users?

Dr-Irv · 2023-03-30T20:22:23Z

That's why I think this should be a PDEP.

@Dr-Irv - can you give a bit more detail here? What about the deprecation process makes you think this? Is it just that expectation that Series.apply is likely used by many users?

Yes, that is my concern. This is a proposed change in behavior, and I think it warrants a PDEP because of that.

Most of the way the issue was written at the top would form the basis of the PDEP, but then we can have a more formal process for discussion and approval.

phofl · 2023-03-30T20:28:38Z

Could you give an example where a user should do ser.apply(func) instead of func(ser) in the target state?

rhshadrach · 2023-03-30T20:40:30Z

I don't believe there is a case where there is a difference in behavior. However I think we should have similar operations on Series, DataFrame, SeriesGroupBy, DataFrameGroupBy where it makes sense, and there is one case where I think it does help users (albeit minor):

def foo(obj: pd.Series | pd.DataFrame):
    result = obj.apply(bar)
    return result

topper-123 · 2023-03-31T11:44:53Z

I've updated the proposal to address @WillAyd's comments, so that the proposal is now that Series.apply will operate on the Series instead of Series._values.

IMO if the concern is the deprecation process and not the change itself, a PDEP is too much, because we will be adding deprecation warnings, and nothing will break in 2.x. So the users can change their code at their own pace (likely when they look through the log and see the warning).

Dr-Irv · 2023-03-31T14:00:54Z

IMO if the concern is the deprecation process and not the change itself, a PDEP is too much, because we will be adding deprecation warnings, and nothing will break in 2.x. So the users can change their code at their own pace (likely when they look through the log and see the warning).

Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted.

jbrockmendel · 2023-03-31T14:54:49Z

Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted.

This is way too broad a standard for when a PDEP is needed.

Dr-Irv · 2023-03-31T15:01:45Z

Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted.

This is way too broad a standard for when a PDEP is needed.

Please see the proposed change to PDEP-1, starting at line 132 here: https://github.com/pandas-dev/pandas/pull/51417/files

There are a lot of examples there, and the particular case described here falls into a case where I think a PDEP is warranted.

I would ask others involved in the governance discussions (@jorisvandenbossche and @MarcoGorelli ) to weigh in with their opinion as well.

matteosantama · 2023-04-02T04:02:44Z

Something to consider: if series.apply(func) effectively becomes func(series), wouldn't it be identical to series.pipe(func)? In which case, I wonder why keep series.apply() around at all?

On a DataFrame I see the benefit because it provides a convenient way to apply it across either of the two axes. But on a Series, with a single axis, .apply() seems redundant.

You can argue that Series.apply() allows for string arguments like "sum" but I think Series.agg() is more appropriate here anyway.

topper-123 · 2023-04-02T10:20:50Z

Hi @matteosantama,

Yeah, originally my idea was that series.apply(func) would be equivalent to func(series._values) + wrapping functionality, but as pointed out by @WillAyd , it is more consistent to have it mean func(series) + wrapping functionality. That makes it very close to Series.pipe and maybe even equivalent, if there won't be any wrapping functionality.

However, there is still a significant difference to Series.pipe in that Series.apply will do many other things than apply a single callable, for example take strings, or lists and dicts, e.g. do Series.apply([np.sum, "mean"]), so IMO this overlap with Series.pipe is not too bad, because we want the method to be consistent with DataFrame.apply.

WillAyd · 2023-04-03T16:56:16Z

I do agree with @Dr-Irv that it would be nice to have this in the PDEP format. There is likely to be feedback / revisions to the proposal, and having it in the PDEP format makes that easier to track and iterate on than editing the OP

topper-123 · 2023-04-07T16:17:56Z

The proposed change here is that in python 3.0 instead of SeriesApply.apply_standard being:

def apply_standard(self) -> Series:
    if isinstance(f, np.ufunc):
        with np.errstate(all="ignore"):
            return f(obj)
    mapped = obj._map_values(mapper=f)
    return obj._constructor(mapped, index=obj.index).__finalize__(obj, method="apply")

it will be:

def apply_standard(self) -> Series:
    with np.errstate(all="ignore"):
        mapped = f(obj)
    return obj._constructor(mapped, index=obj.index).__finalize__(obj, method="apply")

Are we sure this warrants a PDEP? I can write up a PDEP if that's required, but seems maybe heavy to me process-wise.

I've BTW added a new example in the list above (at 1.5 because it's a bit similar to point 1) discussing callables compared to list/dicts of callables.

Dr-Irv · 2023-04-07T17:07:20Z

Are we sure this warrants a PDEP? I can write up a PDEP if that's required, but seems maybe heavy to me process-wise.

IMHO, it requires a PDEP because you are changing the behavior of the API, not just the implementation. For example, you wrote:
"Series.apply takes callables that always operate on the series. I.e. let series.apply(func) be similar to func(series) + the needed additional functionality."

This is a big change in behavior that may affect many users, and that's why I think a PDEP is warranted.

mroeschke · 2023-04-07T17:14:06Z

I agree with all the points mentioned in the OP in the Proposal section.

I do not agree this requires a PDEP as it only affects the Series.apply and Series.map APIs. If it affected all apply and map APIs I would think it would require a PDEP

topper-123 · 2023-04-07T20:14:19Z

This proposal will actually not change Series.map. It was originally point 4 in the proposal to add a parameter convert_dtype to Series.map, but that was dropped after comments.

topper-123 · 2023-06-04T12:26:55Z

I've added an additional example of inconsistencies (see example 7), this the giving dictlikes vs. listlikes as argument to Series.apply.

MarcoGorelli · 2023-08-10T21:43:47Z

First, thanks for thinking about this in such detail

Second, sorry for not having joined the conversation til now

But third, I very much think this requires a PDEP. It's a huge change, even if it only affects the APIs of two functions

series.apply(lambda x: elementwise_function(x)) is an incredibly common pattern.

quick example (there's literally tonnes more...it's almost harder to find a kaggle notebook which doesn't have this kind of pattern...)

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

df['text']=df['text'].apply(lambda x : remove_URL(x))

So yes, +1 for a PDEP, a change this big needs visibility

phofl · 2023-08-10T21:47:20Z

Personally, I don’t think that this needs a PDEP, the problem is very simple, but I think that this needs a vote on whether we want to deprecate apply operating elemwise.

Dr-Irv · 2023-08-10T22:29:24Z

But third, I very much think this requires a PDEP. It's a huge change, even if it only affects the APIs of two functions

series.apply(lambda x: elementwise_function(x)) is an incredibly common pattern.

I use that pattern all the time.

I think part of the confusion here is the difference between DataFrame.apply(func, axis=1) and DataFrame.apply(func, axis=0). I use the former a lot to have a function apply to each row of a DataFrame. So when I do Series.apply(), I'm thinking of it element by element. The question is whether Series.apply(func) should be like DataFrame.apply(func, axis=1) (my preference) versus DataFrame.apply(func, axis=0)

Maybe what is really needed here is to introduce new names to clarify behavior. We should have one name that means "apply_function_to_each_element_row". We should have another name that means "apply_function_to_the_series". Leave the existing names (apply and map) with the existing behavior, and promote usage of the new names. Then deprecate the old names. Make the naming consistent between DataFrame, Series, SeriesGroupBy and DataFrameGroupBy so that the intentions are clear.

phofl · 2023-08-11T15:31:51Z

I am -1 on that, too noisy for users, I don’t think that it’s worth it

topper-123 · 2023-08-15T15:54:02Z

A few comments from me about use cases:

Regarding the example from @MarcoGorelli, it is the same as using map (i.e. df['text']=df['text'].map(remove_URL), because Series.apply currently falls back to use Series.map.

Also, IMO there is not a need to duplicate the functionality of Series.map in Series.apply, and it will be clearer for users to only have one method that maps a callable to each element. An advantage that Series.map has over Series.apply is that it has the na_action parameter, so user can do the common pattern df['text'].map(remove_URL, na_action="ignore"), which is very often useful and not available when using Series.apply.

In my suggestion the deprecation message will direct users to use Series.map if they want to operate elementwise.

Also, something I haven't highlighted enough is the current difference between Series.apply and DataFrame.apply: A very frustrating situation IMO happens when users develop something to work with Series.apply and then when using it in DataFrame.apply, it works differently or fails (because Series.apply operates elementwise and DataFrame.apply operates on the whole series).

Inversely, if a function has been developed for DataFrame.apply and users give it to Series.apply, pandas will sometimes fail/raise error, or give the correct result, but much slower than when using DataFrame.apply.

So my suggestion will also streamline Series.apply to work the same the way DataFrame.apply currently works.

@Dr-Irv:

I think part of the confusion here is the difference between DataFrame.apply(func, axis=1) and DataFrame.apply(func, axis=0). I use the former a lot to have a function apply to each row of a DataFrame. So when I do Series.apply(), I'm thinking of it element by element. The question is whether Series.apply(func) should be like DataFrame.apply(func, axis=1) (my preference) versus DataFrame.apply(func, axis=0)

DataFrame.apply(func, axis=0) & DataFrame.apply(func, axis=1) both operates on the whole series, see:

pandas/pandas/core/apply.py

Lines 956 to 980 in fc30823

    
           def apply_standard(self): 
        
               results, res_index = self.apply_series_generator() 
        
               # wrap results 
        
               return self.wrap_results(results, res_index) 
        
           def apply_series_generator(self) -> tuple[ResType, Index]: 
        
               assert callable(self.func) 
        
               series_gen = self.series_generator 
        
               res_index = self.result_index 
        
               results = {} 
        
               with option_context("mode.chained_assignment", None): 
        
                   for i, v in enumerate(series_gen): 
        
                       # ignore SettingWithCopy here in case the user mutates 
        
                       results[i] = self.func(v, *self.args, **self.kwargs) 
        
                       if isinstance(results[i], ABCSeries): 
        
                           # If we have a view on v, we need to make a copy because 
        
                           #  series_generator will swap out the underlying data 
        
                           results[i] = results[i].copy(deep=False) 
        
               return results, res_index

Note expecially line 973. If you think I'm wrong here, can you please explain.

topper-123 · 2023-08-15T16:30:15Z

Personally, I don’t think that this needs a PDEP, the problem is very simple, but I think that this needs a vote on whether we want to deprecate apply operating elemwise.

I agree. Also, this issue has been hanging for quite a while, so a vote would be nice in order get a decision on this issue. We could include a vote if a PDEP needs to be written, if that's desired (though I don't know if we got procedures for multiple chioce votes?).

jbrockmendel · 2023-08-15T16:41:50Z

-0.15 on this needing a pdep, +1 on the change @rhshadrach described at the sprint

topper-123 · 2023-08-15T22:13:54Z

@rhshadrach, could you maybe summarize the conclusions from the sprint wrt. this issue?

Dr-Irv · 2023-08-16T02:20:10Z

I think part of the confusion here is the difference between DataFrame.apply(func, axis=1) and DataFrame.apply(func, axis=0). I use the former a lot to have a function apply to each row of a DataFrame. So when I do Series.apply(), I'm thinking of it element by element. The question is whether Series.apply(func) should be like DataFrame.apply(func, axis=1) (my preference) versus DataFrame.apply(func, axis=0)

DataFrame.apply(func, axis=0) & DataFrame.apply(func, axis=1) both operates on the whole series, see:

Internally, yes, but I would argue that semantically there is another way to think about it.

>>> df = pd.DataFrame({"x":[1,2,3],"y":[4,5,6]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
>>> df["x"].apply(lambda v: 10*v)
0    10
1    20
2    30
Name: x, dtype: int64
>>> df.apply(lambda r: r["x"] + r["y"], axis=1)
0    5
1    7
2    9
dtype: int64

The pattern shown in df["x"].apply(lambda v: 10*v) is applying an operation to each element of the series df["x"].
Another interpretation is that the pattern is applying that operation to each ROW of the series df["x"]
Similarly, the pattern df.apply(lambda r: r["x"] + r["y"], axis=1) is applying that operation to each ROW of the DataFrame df

Put on the hat of a user. As a user, I find this way of looking at Series.apply(func) and df.apply(func, axis=1) being similar to be very convenient.

I go back to my earlier suggestion. If the goal is to make the semantic meanings clear, keep the current behavior, and introduce new methods that use the new behavior with clear documentation as to these differences.

attack68 · 2023-08-16T05:31:19Z

For what its worth, although Styler does not deal with Series, it takes pains to document that map is element-wise and apply is column-wise or row-wise and the return value of each user function must be either single value or sequence. (https://pandas.pydata.org/docs/dev/user_guide/style.html#Acting-on-Data)
This is relevant because styling functions are often some form of conditional or categorisation.

This is one of the more frequent errors on stack overflow in questions. I think it helps that now applymap is map since it provides more of a distinction.

df = pd.DataFrame([1,2])
df.style.map(lambda v: "color: red;" if v > 1 else "")  # works
df.style.apply(lambda s: np.where(s>1, "color: red;", "")) # works
df.style.apply(lambda v: "color: red;" if v > 1 else "")  # fails. The truth value of a Series is ambiguous.

I support the change because I think trying to unify the mindset and function use of users is overall positive.

+0 on requiring a PDEP, but initial reaction from more than a couple members for its need suggests it should be considered?

MarcoGorelli · 2023-08-19T07:30:02Z

I too support the change, I'd just like it to have a little more visibility - if it's a pdep, then it's more likely to be shared / talked about at conferences. If others don't want to (which is absolutely fine!), I could do the work of writing up the document

topper-123 · 2023-08-25T05:07:32Z

@Dr-Irv

I find this way of looking at Series.apply(func) and df.apply(func, axis=1) being similar to be very convenient.

But those are not the same and have the same general issues I've lined up, i.e. the Series.apply operates element-wise, while df.apply(..., axis=1) operates series-wise.

@MarcoGorelli: I've made a sketch for a PDEP and will push it today.

Dr-Irv · 2023-08-25T13:16:14Z

@Dr-Irv

I find this way of looking at Series.apply(func) and df.apply(func, axis=1) being similar to be very convenient.

But those are not the same and have the same general issues I've lined up, i.e. the Series.apply operates element-wise, while df.apply(..., axis=1) operates series-wise.

I think of it that Series.apply() operates element-wise, which can be thought of as operating on every "row" of a series, while DataFrame.apply(..., axis=1) operates on every row of a DataFrame. The difference is that the function sfunc() passed in to Series.apply(sfunc) takes a scalar as input, while the function dfunc() passed to DataFrame.apply(dfunc, axis=1) takes a Series as input.

In thinking about this more., because the behavior of Series.apply() depends on the function passed in, I'd suggest deprecating the use of apply() throughout the entire API, and create a new method name, maybe invoke(), that does what you are proposing apply() should do, and then we tell people to stop using apply() all together because the word is ambiguous, and the behavior is inconsistent.

The advantage of totally deprecating apply() is that people will eventually be forced to change their code, whereas if you just deprecate the behavior, it won't be as clear as to the cases where you need to change your code, versus leave it alone.

MarcoGorelli · 2023-08-25T15:05:24Z

we tell people to stop using apply() all together because the word is ambiguous

I've suggested something similar here - shall we move the conversation there, now that there's a pdep?

topper-123 · 2023-08-26T06:41:06Z

shall we move the conversation there, now that there's a pdep?

Ok for me.

@jbrockmendel mentioned though in a comment a discussion about this at the sprint. I wasn't at the sprint, so am not up to date with that discussion. Could someone talk a bit about that, especially if there was a consensus about a path forward?

MarcoGorelli · 2023-08-26T06:48:48Z

I think the change is what's described in the pdep

rhshadrach · 2023-08-31T20:56:17Z

@rhshadrach, could you maybe summarize the conclusions from the sprint wrt. this issue?

During the sprint, I described the issues with the current behavior of DataFrame.apply, Series.apply, DataFrame.agg, and Series.agg and how the code was intertwined. The consensus I took away was that this is an issue we'd like to fix but we are concerned with how noisy of a change this would be.

jbrockmendel added the Apply Apply, Aggregate, Transform, Map label Mar 24, 2023

topper-123 mentioned this issue Mar 25, 2023

ENH: Series Mapping na_action=ignore result is misleading. #47262

Open

topper-123 mentioned this issue Mar 28, 2023

DEPR: Deprecate the convert_dtype param in Series.Apply #52257

Merged

jbrockmendel mentioned this issue Apr 1, 2023

PDEP-1 first revision (scope) #51417

Merged

topper-123 mentioned this issue Apr 1, 2023

API: rename DataFrame.applymap -> DataFrame.map #52353

Closed

rhshadrach mentioned this issue Apr 23, 2023

Semantics between Series.apply and Series.agg incorrect #51880

Closed

This was referenced May 21, 2023

BUG: make Series.agg aggregate when possible #53324

Closed

REF: Decouple Series.apply from Series.agg #53400

Merged

MarcoGorelli mentioned this issue Aug 11, 2023

Rename apply -> apply_elementwise for greater separation from pandas? pola-rs/polars#10423

Closed

topper-123 mentioned this issue Aug 25, 2023

PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747

Closed

topper-123 changed the title ~~API: make the func in Series.apply always operate on the array~~ API: make the func in Series.apply always operate on the Series Aug 31, 2023

topper-123 mentioned this issue Aug 31, 2023

DEPR: deprecate element-wise operations in (Series|DataFrame).transform #54906

Open

rhshadrach mentioned this issue Jan 6, 2024

DEPR: by_row="compat" in DataFrame.apply and Series.apply #56750

Closed

5 tasks

lizgehret mentioned this issue May 9, 2024

MAINT: convert_dtype param deprecation qiime2/qiime2#767

Merged

API: make the func in Series.apply always operate on the Series #52140

API: make the func in Series.apply always operate on the Series #52140

Comments

topper-123 commented Mar 23, 2023 • edited Loading

Similarities and differences between Series.apply and Series.map

The problems

1: string vs numpy funcs in Series.apply

1.5 Callables vs. list/dict of callables (added 2023-04-07)

2. Functions in Series.apply (& Series.transform)

3. ufuncs in Series.apply vs. in Series.map

4. callables in Series.apply is bad, callables in SeriesGroupby.apply is fine

5: callables in Series.apply that return Series transform data to a DataFrame

6. Series.apply vs. Series.agg

7. dictlikes vs. listlikes in Series.apply (added 2023-06-04)

Proposal

Deprecation process

rhshadrach commented Mar 24, 2023

topper-123 commented Mar 24, 2023

topper-123 commented Mar 27, 2023

WillAyd commented Mar 28, 2023

rhshadrach commented Mar 28, 2023

topper-123 commented Mar 29, 2023 • edited Loading

Dr-Irv commented Mar 29, 2023

topper-123 commented Mar 29, 2023

Dr-Irv commented Mar 29, 2023

rhshadrach commented Mar 30, 2023 • edited Loading

Dr-Irv commented Mar 30, 2023

phofl commented Mar 30, 2023

rhshadrach commented Mar 30, 2023

topper-123 commented Mar 31, 2023 • edited Loading

Dr-Irv commented Mar 31, 2023

jbrockmendel commented Mar 31, 2023

Dr-Irv commented Mar 31, 2023

matteosantama commented Apr 2, 2023

topper-123 commented Apr 2, 2023

WillAyd commented Apr 3, 2023

topper-123 commented Apr 7, 2023 • edited Loading

Dr-Irv commented Apr 7, 2023

mroeschke commented Apr 7, 2023

topper-123 commented Apr 7, 2023 • edited Loading

topper-123 commented Jun 4, 2023

MarcoGorelli commented Aug 10, 2023

phofl commented Aug 10, 2023

Dr-Irv commented Aug 10, 2023

phofl commented Aug 11, 2023

topper-123 commented Aug 15, 2023 • edited Loading

topper-123 commented Aug 15, 2023

jbrockmendel commented Aug 15, 2023

topper-123 commented Aug 15, 2023

Dr-Irv commented Aug 16, 2023

attack68 commented Aug 16, 2023

MarcoGorelli commented Aug 19, 2023

topper-123 commented Aug 25, 2023

Dr-Irv commented Aug 25, 2023

MarcoGorelli commented Aug 25, 2023

topper-123 commented Aug 26, 2023

MarcoGorelli commented Aug 26, 2023

rhshadrach commented Aug 31, 2023

topper-123 commented Mar 23, 2023 •

edited

Loading

Similarities and differences between `Series.apply` and `Series.map`

1: string vs numpy funcs in `Series.apply`

2. Functions in `Series.apply` (& `Series.transform`)

3. ufuncs in `Series.apply` vs. in `Series.map`

4. callables in `Series.apply` is bad, callables in `SeriesGroupby.apply` is fine

5: callables in `Series.apply` that return Series transform data to a DataFrame

6. `Series.apply` vs. `Series.agg`

7. dictlikes vs. listlikes in `Series.apply` (added 2023-06-04)

topper-123 commented Mar 29, 2023 •

edited

Loading

rhshadrach commented Mar 30, 2023 •

edited

Loading

topper-123 commented Mar 31, 2023 •

edited

Loading

topper-123 commented Apr 7, 2023 •

edited

Loading

topper-123 commented Apr 7, 2023 •

edited

Loading

topper-123 commented Aug 15, 2023 •

edited

Loading