Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: make the func in Series.apply always operate on the Series #52140

Open
topper-123 opened this issue Mar 23, 2023 · 42 comments
Open

API: make the func in Series.apply always operate on the Series #52140

topper-123 opened this issue Mar 23, 2023 · 42 comments
Labels
Apply Apply, Aggregate, Transform, Map

Comments

@topper-123
Copy link
Contributor

topper-123 commented Mar 23, 2023

I've lately worked on making Series.map simpler as part of implementing the na_action on all ExtensionArray.map methods. As part of that, I made #52033. That PR (and the current SeriesApply.apply_standard more generally) very clearly shows how Series.apply & Series.map are very similar, but different enough for it to be confusing when it's a good idea to use one over the other and when Series.apply especially is a bad idea to use.

I propose doing some changes in how Series.apply work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.

I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.

Similarities and differences between Series.apply and Series.map

The similarity between the methods is especially that they both fall back to use Series._map_values and there use algorithms.map_array or ExtensionArray.map as relevant.

The differences are many, but each one is relative minor:

  1. Series.apply has a convert_dtype parameter, which Series.map doesn't
  2. Series.map has a na_action parameter, which Series.apply doesn't
  3. Series.apply can take advantage of numpy ufuncs, which Series.map can't
  4. Series.apply can take args and **kwargs, which Series.map can't
  5. Series.apply will return a Dataframe, if its result is a listlike of Series, which Series.map won't
  6. Series.apply is more general and can take a string, e.g. "sum", or lists or dicts of inputs which Series.map can't.

Also, Series.apply is a bit of a parent method of Series.agg & Series.transform.

The problems

The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use .apply over .map to do operations, and vica versa. I will show some examples below.

First some setup:

>>> import numpy as np
>>> import pandas as pd 
>>>
>>> small_ser = pd.Series([1, 2, 3])
>>> large_ser = pd.Series(range(100_000))

1: string vs numpy funcs in Series.apply

>>> small_ser.apply("sum")
6
>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64

It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.

Note that giving np.sum to DataFrame.apply aggregates properly:

>>> small_ser.to_frame().apply(np.sum)
0    6
dtype: int64

1.5 Callables vs. list/dict of callables (added 2023-04-07)

>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.apply([np.sum])
sum    6
dtype: int64

Also with non-numpy callables:

>>> small_ser.apply(lambda x: x.sum())
AttributeError: 'int' object has no attribute 'sum'
>>> small_ser.apply([lambda x: x.sum()])
<lambda>    6
dtype: int64

In both cases above the difference is that Series.apply operates element-wise, if given a callable, but series-wise if given a list/dict of callables.

2. Functions in Series.apply (& Series.transform)

The Series.apply doc string have examples with using lambdas, but lambdas in Series.apply is a bad practices because of bad performance:

>>> %timeit large_ser.apply(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop

Currently, Series does not have a method that makes a callable operate on a series' data. Instead users need to use Series.pipe for that operation in order for the operation to be efficient:

>>> %timeit large_ser.pipe(lambda x: x + 1)
44 µs ± 363 ns per loop

(The reason for the above performance differences is that apply gets called on each single element, while pipe calls x.__add__(1), which operates on the whole array).

Note also that .pipe operates on the Series while applycurrently operates on each element in the data, so there is some differences that may have some consequence in some cases.

Also notice that Series.transform has the same performance problems:

>>> %timeit large_ser.transform(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop

3. ufuncs in Series.apply vs. in Series.map

Performance-wise, ufuncs are fine in Series.apply, but not in Series.map:

>>> %timeit large_ser.apply(np.sqrt)
71.6 µs ± 1.17 µs per loop
>>> %timeit large_ser.map(np.sqrt)
63.9 ms ± 69.5 µs per loop

It's difficult for users to understand why one is fast and the other slow (answer: only apply correctly works with ufuncs).

It is also difficult to understand why ufuncs are fast in apply, while other callables are slow in apply (answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).

4. callables in Series.apply is bad, callables in SeriesGroupby.apply is fine

I showed above that using (non-ufunc) callables in Series.apply is bad performancewise. OTOH using them in SeriesGroupby.apply is fine:

>>> %timeit large_ser.apply(lambda x: x + 1)
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit large_ser.groupby(large_ser > 50_000).apply(lambda x: x + 1)
11.3 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the apply op is much larger, and similar to example 2 above.

Having callables being ok to use in the SeriesGroupby.apply method, but not in the Series.Apply is confusing IMO.

5: callables in Series.apply that return Series transform data to a DataFrame

Series.apply has an exception that if the callable returns a list-like of Series, the Series will be concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:

>>> small_ser.apply(lambda x: pd.Series([x, x+1], index["a", "b"]))
   a   b
0  0   1
1  1   2
2  2   3
>>> %timeit large_ser.apply(lambda x: pd.Series([x, x+1]))
# timing takes too long to measure

It's probably never a good idea to use this pattern, and e.g. .pipe is much faster, so e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1})) will be much faster. If we really do need operation on single element in that fashion it is still possible using pipe, e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func))) and also just directly pd.DataFrame({"a": large_ser, "b": large_ser.map(some_func))).

So giving callables that return Series to Series.apply is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).

6. Series.apply vs. Series.agg

The doc string for Series.agg says about the method's func parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:

>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.agg(np.sum)
6

You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to agg and to apply".

7. dictlikes vs. listlikes in Series.apply (added 2023-06-04)

Giving a list of transforming arguments to Series.apply returns a DataFrame:

>>> small_ser.apply(["sqrt", np.abs])
       sqrt  absolute
0  1.000000         1
1  1.414214         2
2  1.732051         3

But giving a dict of transforming arguments returns a Series with a MultiIndex:

>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
sqrt  0    1.000000
      1    1.414214
      2    1.732051
abs   0    1.000000
      1    2.000000
      2    3.000000
dtype: float64

These two should give same-shaped output for consistency. Using Series.transform instead of Series.apply, it returns a DataFrame in both cases and I think the dictlike example above should return a DataFrame similar to the listlike example.

Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using apply.

Proposal

With the above in mind, I propose that:

  1. Series.apply takes callables that always operate on the series. I.e. let series.apply(func) be similar to func(series) + the needed additional functionality.
  2. Series.map takes callables that operate on each element individually. I.e. series.map(func) will be similar to the current series._map_values(func) + the needed additional functionality.
  3. The parameter convert_dtype will be deprecated in Series.apply (already done in DEPR: Deprecate the convert_dtype param in Series.Apply #52257).
  4. A parameter convert_dtype will NOT be added to Series.map (comment) by @rhshadrach).
  5. The ability in Series.apply to convert a list[Series] to a DataFrame will be deprecated (already done in DEPR: Deprecate returning a DataFrame in SeriesApply.apply_standard #52123).
  6. The ability to convert a list[Series] to a DataFrame will NOT be added to Series.map.
  7. The changes made to Series.applywill propagate to Series.agg and Series.transform.

The difference between Series.apply() & Series.map() will then be that:

  • Series.apply() makes the passed-in callable operate on the series, similarly to how (DataFrame|SeriesGroupby|DataFrameGroupBy).apply. operate on series. This is very fast and can do almost anything,
  • Series.map() makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used if Series.apply can't do it.

so, IMO, this API change will help make Pandas Series.(apply|map) API simpler without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.

Deprecation process

The cumbersome part of the deprecation process will be to change Series.apply to only work array-wise, ie. to do func(series._values) always. This can be done by adding an array_ops_only parameter to Series.apply, so:

>>> def apply(self, ..., array_ops_only: bool | NoDefault=no_default, ...):
    if array_ops_only is no_default:
        warn("....")
        array_ops_only = False
    ...

and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.

The other changes are more easy: convert_dtype in Series.apply will be deprecated just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.

@rhshadrach
Copy link
Member

Somewhat related: #49673. But I think these changes make sense in combination with that issue.

I think this sounds great - I've always disliked the difference between SeriesGroupBy.apply and Series.apply and this also clears that up.

If we were to implement #35725 (comment), then I think there would be no difference between Series.agg and Series.apply (but I'm not certain of this). I don't think that's an issue - I'd rather have consistent behavior even if the implementations collide in the 1-d case.

4. A parameter convert_dtype will be added to Series.map.

Is this necessary? I think this is just ser.map(...).convert_dtypes(), which is not much more verbose than using an argument.

This isn't necessary, but we could also rename DataFrame.applymap to DataFrame.map as this would have a similar meaning to Series.map (act on values individually).

I suspect the deprecation of this will not be straightforward because apply internally sometimes uses agg and agg sometimes uses apply. This would be fixed by #49673.

@jbrockmendel jbrockmendel added the Apply Apply, Aggregate, Transform, Map label Mar 24, 2023
@topper-123
Copy link
Contributor Author

Yes my thought is that this is related than #49673, but not the same. Both will make the code base clearer.

4. A parameter convert_dtype will be added to Series.map.

Is this necessary? I think this is just ser.map(...).convert_dtypes(), which is not much more verbose than using an argument.

I don't have a super strong opinion about this parameter, maybe it's superflous, it's just that if we keep it, it fits better in Series.map than in Series.apply in the new setup. But notice also that .apply(convert_dtype=...) is not related to ser.apply().convert_dtypes(), they're entirely different concepts. series.apply(convert_dtype=False) just means that pandas won't try to convert the internal ndarry(dtype=object) to the dtype of the calling series.

This isn't necessary, but we could also rename DataFrame.applymap to DataFrame.map as this would have a similar meaning to Series.map (act on values individually).

If the reception to this PR was positive, I was actually planning on following up with proposing to change DataFrame.applymap -> DataFrame.map. Nice that you agree with that.

I suspect the deprecation of this will not be straightforward because apply internally sometimes uses agg and agg sometimes uses apply. This would be fixed by #49673.

Possibly yeah. But it could also go the other way (fixing this would make #49673 easier:-)). Both work towards the same goal (simpler code paths) from different directions, so there could be mutual benefits in both directions.

@topper-123
Copy link
Contributor Author

@pandas-dev/pandas-core, any comments/objections? I'll start on this soon and would appreciate some early comments, if you see things differently than me...

@WillAyd
Copy link
Member

WillAyd commented Mar 28, 2023

  1. Series.apply takes callables that always operate on the array. I.e. let series.apply(func) be similar to func(series._values)

This seems nice from a technical perspective but I don't know that a lot of users know/care much about ._values; I'm personally not even sure of all the rules for what ._values brings.

I'm guessing apply is used more often. Not blocking any progression on this, but I have a concern about setting expectations that end users have some knowledge of the internal value storage

@rhshadrach
Copy link
Member

Thanks @WillAyd - I missed this. Shouldn't series.apply(func) operate as func(series)? I believe all other methods like this - apply (in groupby), agg, transform - act on pandas objects (Series / DataFrame).

@topper-123
Copy link
Contributor Author

topper-123 commented Mar 29, 2023

You are right @WillAyd, I just checked for DataFrame.apply and it does indeed work on series and not series data.

My intention was to make apply operate the same way for Series as for other data structures, and somehow thought the others operated on values. So of course, if the other operate on Series, IMO this one should too. This will still also archieve the goal of having .apply operate on the whole series (i.e. it's fast), while .map operates elementwise (i.e. is potentially slow).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 29, 2023

@pandas-dev/pandas-core, any comments/objections? I'll start on this soon and would appreciate some early comments, if you see things differently than me...

I think this warrants a PDEP. Right now, Series.apply(lambda x: x+1) applies the function to each element of a series. This can be very useful. If you change it so that the callable is applied to entire series, then the result could be ambiguous.

@topper-123
Copy link
Contributor Author

Series.apply(lambda x: x+1) applies the function to each element of a series. This can be very useful. If you change it so that the callable is applied to entire series, then the result could be ambiguous.

Series.apply(lambda x: x + 1) is already synonymous with Series.map(lambda x: x+1) today, so we'd just direct users to use Series.map if they want to operate on each element. In practice, they'll probably actually want to do the series-operation, because that is so much faster while giving the same result in many (most?) cases.

The deprecation process would of course need to be clear in this regard, so that will be part of the PR.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 29, 2023

The deprecation process would of course need to be clear in this regard, so that will be part of the PR.

That's why I think this should be a PDEP.

I think part of the issue here is that the string version of the arguments (e.g., "sum", "min") operate differently than the ufuncs ('np.sum', 'np.min')

@rhshadrach
Copy link
Member

rhshadrach commented Mar 30, 2023

That's why I think this should be a PDEP.

@Dr-Irv - can you give a bit more detail here? What about the deprecation process makes you think this? Is it just that expectation that Series.apply is likely used by many users?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 30, 2023

That's why I think this should be a PDEP.

@Dr-Irv - can you give a bit more detail here? What about the deprecation process makes you think this? Is it just that expectation that Series.apply is likely used by many users?

Yes, that is my concern. This is a proposed change in behavior, and I think it warrants a PDEP because of that.

Most of the way the issue was written at the top would form the basis of the PDEP, but then we can have a more formal process for discussion and approval.

@phofl
Copy link
Member

phofl commented Mar 30, 2023

Could you give an example where a user should do ser.apply(func) instead of func(ser) in the target state?

@rhshadrach
Copy link
Member

I don't believe there is a case where there is a difference in behavior. However I think we should have similar operations on Series, DataFrame, SeriesGroupBy, DataFrameGroupBy where it makes sense, and there is one case where I think it does help users (albeit minor):

def foo(obj: pd.Series | pd.DataFrame):
    result = obj.apply(bar)
    return result

@topper-123
Copy link
Contributor Author

topper-123 commented Mar 31, 2023

I've updated the proposal to address @WillAyd's comments, so that the proposal is now that Series.apply will operate on the Series instead of Series._values.

IMO if the concern is the deprecation process and not the change itself, a PDEP is too much, because we will be adding deprecation warnings, and nothing will break in 2.x. So the users can change their code at their own pace (likely when they look through the log and see the warning).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 31, 2023

IMO if the concern is the deprecation process and not the change itself, a PDEP is too much, because we will be adding deprecation warnings, and nothing will break in 2.x. So the users can change their code at their own pace (likely when they look through the log and see the warning).

Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted.

@jbrockmendel
Copy link
Member

Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted.

This is way too broad a standard for when a PDEP is needed.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 31, 2023

Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted.

This is way too broad a standard for when a PDEP is needed.

Please see the proposed change to PDEP-1, starting at line 132 here: https://github.com/pandas-dev/pandas/pull/51417/files

There are a lot of examples there, and the particular case described here falls into a case where I think a PDEP is warranted.

I would ask others involved in the governance discussions (@jorisvandenbossche and @MarcoGorelli ) to weigh in with their opinion as well.

@matteosantama
Copy link
Contributor

Something to consider: if series.apply(func) effectively becomes func(series), wouldn't it be identical to series.pipe(func)? In which case, I wonder why keep series.apply() around at all?

On a DataFrame I see the benefit because it provides a convenient way to apply it across either of the two axes. But on a Series, with a single axis, .apply() seems redundant.

You can argue that Series.apply() allows for string arguments like "sum" but I think Series.agg() is more appropriate here anyway.

@topper-123
Copy link
Contributor Author

Hi @matteosantama,

Yeah, originally my idea was that series.apply(func) would be equivalent to func(series._values) + wrapping functionality, but as pointed out by @WillAyd , it is more consistent to have it mean func(series) + wrapping functionality. That makes it very close to Series.pipe and maybe even equivalent, if there won't be any wrapping functionality.

However, there is still a significant difference to Series.pipe in that Series.apply will do many other things than apply a single callable, for example take strings, or lists and dicts, e.g. do Series.apply([np.sum, "mean"]), so IMO this overlap with Series.pipe is not too bad, because we want the method to be consistent with DataFrame.apply.

@WillAyd
Copy link
Member

WillAyd commented Apr 3, 2023

I do agree with @Dr-Irv that it would be nice to have this in the PDEP format. There is likely to be feedback / revisions to the proposal, and having it in the PDEP format makes that easier to track and iterate on than editing the OP

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 7, 2023

The proposed change here is that in python 3.0 instead of SeriesApply.apply_standard being:

def apply_standard(self) -> Series:
    if isinstance(f, np.ufunc):
        with np.errstate(all="ignore"):
            return f(obj)
    mapped = obj._map_values(mapper=f)
    return obj._constructor(mapped, index=obj.index).__finalize__(obj, method="apply")

it will be:

def apply_standard(self) -> Series:
    with np.errstate(all="ignore"):
        mapped = f(obj)
    return obj._constructor(mapped, index=obj.index).__finalize__(obj, method="apply")

Are we sure this warrants a PDEP? I can write up a PDEP if that's required, but seems maybe heavy to me process-wise.

I've BTW added a new example in the list above (at 1.5 because it's a bit similar to point 1) discussing callables compared to list/dicts of callables.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 7, 2023

Are we sure this warrants a PDEP? I can write up a PDEP if that's required, but seems maybe heavy to me process-wise.

IMHO, it requires a PDEP because you are changing the behavior of the API, not just the implementation. For example, you wrote:
"Series.apply takes callables that always operate on the series. I.e. let series.apply(func) be similar to func(series) + the needed additional functionality."

This is a big change in behavior that may affect many users, and that's why I think a PDEP is warranted.

@mroeschke
Copy link
Member

I agree with all the points mentioned in the OP in the Proposal section.

I do not agree this requires a PDEP as it only affects the Series.apply and Series.map APIs. If it affected all apply and map APIs I would think it would require a PDEP

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 7, 2023

This proposal will actually not change Series.map. It was originally point 4 in the proposal to add a parameter convert_dtype to Series.map, but that was dropped after comments.

@topper-123
Copy link
Contributor Author

I've added an additional example of inconsistencies (see example 7), this the giving dictlikes vs. listlikes as argument to Series.apply.

@MarcoGorelli
Copy link
Member

First, thanks for thinking about this in such detail

Second, sorry for not having joined the conversation til now

But third, I very much think this requires a PDEP. It's a huge change, even if it only affects the APIs of two functions

series.apply(lambda x: elementwise_function(x)) is an incredibly common pattern.

quick example (there's literally tonnes more...it's almost harder to find a kaggle notebook which doesn't have this kind of pattern...)

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

df['text']=df['text'].apply(lambda x : remove_URL(x))

So yes, +1 for a PDEP, a change this big needs visibility

@phofl
Copy link
Member

phofl commented Aug 10, 2023

Personally, I don’t think that this needs a PDEP, the problem is very simple, but I think that this needs a vote on whether we want to deprecate apply operating elemwise.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 10, 2023

But third, I very much think this requires a PDEP. It's a huge change, even if it only affects the APIs of two functions

series.apply(lambda x: elementwise_function(x)) is an incredibly common pattern.

I use that pattern all the time.

I think part of the confusion here is the difference between DataFrame.apply(func, axis=1) and DataFrame.apply(func, axis=0). I use the former a lot to have a function apply to each row of a DataFrame. So when I do Series.apply(), I'm thinking of it element by element. The question is whether Series.apply(func) should be like DataFrame.apply(func, axis=1) (my preference) versus DataFrame.apply(func, axis=0)

Maybe what is really needed here is to introduce new names to clarify behavior. We should have one name that means "apply_function_to_each_element_row". We should have another name that means "apply_function_to_the_series". Leave the existing names (apply and map) with the existing behavior, and promote usage of the new names. Then deprecate the old names. Make the naming consistent between DataFrame, Series, SeriesGroupBy and DataFrameGroupBy so that the intentions are clear.

@phofl
Copy link
Member

phofl commented Aug 11, 2023

I am -1 on that, too noisy for users, I don’t think that it’s worth it

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 15, 2023

A few comments from me about use cases:

Regarding the example from @MarcoGorelli, it is the same as using map (i.e. df['text']=df['text'].map(remove_URL), because Series.apply currently falls back to use Series.map.

Also, IMO there is not a need to duplicate the functionality of Series.map in Series.apply, and it will be clearer for users to only have one method that maps a callable to each element. An advantage that Series.map has over Series.apply is that it has the na_action parameter, so user can do the common pattern df['text'].map(remove_URL, na_action="ignore"), which is very often useful and not available when using Series.apply.

In my suggestion the deprecation message will direct users to use Series.map if they want to operate elementwise.

Also, something I haven't highlighted enough is the current difference between Series.apply and DataFrame.apply: A very frustrating situation IMO happens when users develop something to work with Series.apply and then when using it in DataFrame.apply, it works differently or fails (because Series.apply operates elementwise and DataFrame.apply operates on the whole series).

Inversely, if a function has been developed for DataFrame.apply and users give it to Series.apply, pandas will sometimes fail/raise error, or give the correct result, but much slower than when using DataFrame.apply.

So my suggestion will also streamline Series.apply to work the same the way DataFrame.apply currently works.

@Dr-Irv:

I think part of the confusion here is the difference between DataFrame.apply(func, axis=1) and DataFrame.apply(func, axis=0). I use the former a lot to have a function apply to each row of a DataFrame. So when I do Series.apply(), I'm thinking of it element by element. The question is whether Series.apply(func) should be like DataFrame.apply(func, axis=1) (my preference) versus DataFrame.apply(func, axis=0)

DataFrame.apply(func, axis=0) & DataFrame.apply(func, axis=1) both operates on the whole series, see:

pandas/pandas/core/apply.py

Lines 956 to 980 in fc30823

def apply_standard(self):
results, res_index = self.apply_series_generator()
# wrap results
return self.wrap_results(results, res_index)
def apply_series_generator(self) -> tuple[ResType, Index]:
assert callable(self.func)
series_gen = self.series_generator
res_index = self.result_index
results = {}
with option_context("mode.chained_assignment", None):
for i, v in enumerate(series_gen):
# ignore SettingWithCopy here in case the user mutates
results[i] = self.func(v, *self.args, **self.kwargs)
if isinstance(results[i], ABCSeries):
# If we have a view on v, we need to make a copy because
# series_generator will swap out the underlying data
results[i] = results[i].copy(deep=False)
return results, res_index

Note expecially line 973. If you think I'm wrong here, can you please explain.

@topper-123
Copy link
Contributor Author

Personally, I don’t think that this needs a PDEP, the problem is very simple, but I think that this needs a vote on whether we want to deprecate apply operating elemwise.

I agree. Also, this issue has been hanging for quite a while, so a vote would be nice in order get a decision on this issue. We could include a vote if a PDEP needs to be written, if that's desired (though I don't know if we got procedures for multiple chioce votes?).

@jbrockmendel
Copy link
Member

-0.15 on this needing a pdep, +1 on the change @rhshadrach described at the sprint

@topper-123
Copy link
Contributor Author

@rhshadrach, could you maybe summarize the conclusions from the sprint wrt. this issue?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 16, 2023

I think part of the confusion here is the difference between DataFrame.apply(func, axis=1) and DataFrame.apply(func, axis=0). I use the former a lot to have a function apply to each row of a DataFrame. So when I do Series.apply(), I'm thinking of it element by element. The question is whether Series.apply(func) should be like DataFrame.apply(func, axis=1) (my preference) versus DataFrame.apply(func, axis=0)

DataFrame.apply(func, axis=0) & DataFrame.apply(func, axis=1) both operates on the whole series, see:

Internally, yes, but I would argue that semantically there is another way to think about it.

>>> df = pd.DataFrame({"x":[1,2,3],"y":[4,5,6]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
>>> df["x"].apply(lambda v: 10*v)
0    10
1    20
2    30
Name: x, dtype: int64
>>> df.apply(lambda r: r["x"] + r["y"], axis=1)
0    5
1    7
2    9
dtype: int64

The pattern shown in df["x"].apply(lambda v: 10*v) is applying an operation to each element of the series df["x"].
Another interpretation is that the pattern is applying that operation to each ROW of the series df["x"]
Similarly, the pattern df.apply(lambda r: r["x"] + r["y"], axis=1) is applying that operation to each ROW of the DataFrame df

Put on the hat of a user. As a user, I find this way of looking at Series.apply(func) and df.apply(func, axis=1) being similar to be very convenient.

I go back to my earlier suggestion. If the goal is to make the semantic meanings clear, keep the current behavior, and introduce new methods that use the new behavior with clear documentation as to these differences.

@attack68
Copy link
Contributor

For what its worth, although Styler does not deal with Series, it takes pains to document that map is element-wise and apply is column-wise or row-wise and the return value of each user function must be either single value or sequence. (https://pandas.pydata.org/docs/dev/user_guide/style.html#Acting-on-Data)
This is relevant because styling functions are often some form of conditional or categorisation.

This is one of the more frequent errors on stack overflow in questions. I think it helps that now applymap is map since it provides more of a distinction.

df = pd.DataFrame([1,2])
df.style.map(lambda v: "color: red;" if v > 1 else "")  # works
df.style.apply(lambda s: np.where(s>1, "color: red;", "")) # works
df.style.apply(lambda v: "color: red;" if v > 1 else "")  # fails. The truth value of a Series is ambiguous. 

I support the change because I think trying to unify the mindset and function use of users is overall positive.

+0 on requiring a PDEP, but initial reaction from more than a couple members for its need suggests it should be considered?

@MarcoGorelli
Copy link
Member

I too support the change, I'd just like it to have a little more visibility - if it's a pdep, then it's more likely to be shared / talked about at conferences. If others don't want to (which is absolutely fine!), I could do the work of writing up the document

@topper-123
Copy link
Contributor Author

@Dr-Irv

I find this way of looking at Series.apply(func) and df.apply(func, axis=1) being similar to be very convenient.

But those are not the same and have the same general issues I've lined up, i.e. the Series.apply operates element-wise, while df.apply(..., axis=1) operates series-wise.

@MarcoGorelli: I've made a sketch for a PDEP and will push it today.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 25, 2023

@Dr-Irv

I find this way of looking at Series.apply(func) and df.apply(func, axis=1) being similar to be very convenient.

But those are not the same and have the same general issues I've lined up, i.e. the Series.apply operates element-wise, while df.apply(..., axis=1) operates series-wise.

I think of it that Series.apply() operates element-wise, which can be thought of as operating on every "row" of a series, while DataFrame.apply(..., axis=1) operates on every row of a DataFrame. The difference is that the function sfunc() passed in to Series.apply(sfunc) takes a scalar as input, while the function dfunc() passed to DataFrame.apply(dfunc, axis=1) takes a Series as input.

In thinking about this more., because the behavior of Series.apply() depends on the function passed in, I'd suggest deprecating the use of apply() throughout the entire API, and create a new method name, maybe invoke(), that does what you are proposing apply() should do, and then we tell people to stop using apply() all together because the word is ambiguous, and the behavior is inconsistent.

The advantage of totally deprecating apply() is that people will eventually be forced to change their code, whereas if you just deprecate the behavior, it won't be as clear as to the cases where you need to change your code, versus leave it alone.

@MarcoGorelli
Copy link
Member

we tell people to stop using apply() all together because the word is ambiguous

I've suggested something similar here - shall we move the conversation there, now that there's a pdep?

@topper-123
Copy link
Contributor Author

shall we move the conversation there, now that there's a pdep?

Ok for me.

@jbrockmendel mentioned though in a comment a discussion about this at the sprint. I wasn't at the sprint, so am not up to date with that discussion. Could someone talk a bit about that, especially if there was a consensus about a path forward?

@MarcoGorelli
Copy link
Member

I think the change is what's described in the pdep

@topper-123 topper-123 changed the title API: make the func in Series.apply always operate on the array API: make the func in Series.apply always operate on the Series Aug 31, 2023
@rhshadrach
Copy link
Member

@rhshadrach, could you maybe summarize the conclusions from the sprint wrt. this issue?

During the sprint, I described the issues with the current behavior of DataFrame.apply, Series.apply, DataFrame.agg, and Series.agg and how the code was intertwined. The consensus I took away was that this is an issue we'd like to fix but we are concerned with how noisy of a change this would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map
Projects
None yet
Development

No branches or pull requests

10 participants