-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: make the func in Series.apply always operate on the Series #52140
Comments
Somewhat related: #49673. But I think these changes make sense in combination with that issue. I think this sounds great - I've always disliked the difference between If we were to implement #35725 (comment), then I think there would be no difference between
Is this necessary? I think this is just This isn't necessary, but we could also rename I suspect the deprecation of this will not be straightforward because |
Yes my thought is that this is related than #49673, but not the same. Both will make the code base clearer.
I don't have a super strong opinion about this parameter, maybe it's superflous, it's just that if we keep it, it fits better in
If the reception to this PR was positive, I was actually planning on following up with proposing to change
Possibly yeah. But it could also go the other way (fixing this would make #49673 easier:-)). Both work towards the same goal (simpler code paths) from different directions, so there could be mutual benefits in both directions. |
@pandas-dev/pandas-core, any comments/objections? I'll start on this soon and would appreciate some early comments, if you see things differently than me... |
This seems nice from a technical perspective but I don't know that a lot of users know/care much about I'm guessing apply is used more often. Not blocking any progression on this, but I have a concern about setting expectations that end users have some knowledge of the internal value storage |
Thanks @WillAyd - I missed this. Shouldn't |
You are right @WillAyd, I just checked for My intention was to make |
I think this warrants a PDEP. Right now, |
The deprecation process would of course need to be clear in this regard, so that will be part of the PR. |
That's why I think this should be a PDEP. I think part of the issue here is that the string version of the arguments (e.g., "sum", "min") operate differently than the ufuncs ('np.sum', 'np.min') |
@Dr-Irv - can you give a bit more detail here? What about the deprecation process makes you think this? Is it just that expectation that |
Yes, that is my concern. This is a proposed change in behavior, and I think it warrants a PDEP because of that. Most of the way the issue was written at the top would form the basis of the PDEP, but then we can have a more formal process for discussion and approval. |
Could you give an example where a user should do ser.apply(func) instead of func(ser) in the target state? |
I don't believe there is a case where there is a difference in behavior. However I think we should have similar operations on Series, DataFrame, SeriesGroupBy, DataFrameGroupBy where it makes sense, and there is one case where I think it does help users (albeit minor):
|
I've updated the proposal to address @WillAyd's comments, so that the proposal is now that IMO if the concern is the deprecation process and not the change itself, a PDEP is too much, because we will be adding deprecation warnings, and nothing will break in 2.x. So the users can change their code at their own pace (likely when they look through the log and see the warning). |
Given that you are proposing a change in behavior (which would require a deprecation cycle anyway), I believe that a PDEP is warranted. |
This is way too broad a standard for when a PDEP is needed. |
Please see the proposed change to PDEP-1, starting at line 132 here: https://github.com/pandas-dev/pandas/pull/51417/files There are a lot of examples there, and the particular case described here falls into a case where I think a PDEP is warranted. I would ask others involved in the governance discussions (@jorisvandenbossche and @MarcoGorelli ) to weigh in with their opinion as well. |
Something to consider: if On a DataFrame I see the benefit because it provides a convenient way to apply it across either of the two axes. But on a Series, with a single axis, You can argue that |
Hi @matteosantama, Yeah, originally my idea was that However, there is still a significant difference to |
I do agree with @Dr-Irv that it would be nice to have this in the PDEP format. There is likely to be feedback / revisions to the proposal, and having it in the PDEP format makes that easier to track and iterate on than editing the OP |
The proposed change here is that in python 3.0 instead of def apply_standard(self) -> Series:
if isinstance(f, np.ufunc):
with np.errstate(all="ignore"):
return f(obj)
mapped = obj._map_values(mapper=f)
return obj._constructor(mapped, index=obj.index).__finalize__(obj, method="apply") it will be: def apply_standard(self) -> Series:
with np.errstate(all="ignore"):
mapped = f(obj)
return obj._constructor(mapped, index=obj.index).__finalize__(obj, method="apply") Are we sure this warrants a PDEP? I can write up a PDEP if that's required, but seems maybe heavy to me process-wise. I've BTW added a new example in the list above (at 1.5 because it's a bit similar to point 1) discussing callables compared to list/dicts of callables. |
IMHO, it requires a PDEP because you are changing the behavior of the API, not just the implementation. For example, you wrote: This is a big change in behavior that may affect many users, and that's why I think a PDEP is warranted. |
I agree with all the points mentioned in the OP in the I do not agree this requires a PDEP as it only affects the |
This proposal will actually not change |
I've added an additional example of inconsistencies (see example 7), this the giving dictlikes vs. listlikes as argument to |
First, thanks for thinking about this in such detail Second, sorry for not having joined the conversation til now But third, I very much think this requires a PDEP. It's a huge change, even if it only affects the APIs of two functions
quick example (there's literally tonnes more...it's almost harder to find a kaggle notebook which doesn't have this kind of pattern...) def remove_URL(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'',text)
df['text']=df['text'].apply(lambda x : remove_URL(x)) So yes, +1 for a PDEP, a change this big needs visibility |
Personally, I don’t think that this needs a PDEP, the problem is very simple, but I think that this needs a vote on whether we want to deprecate apply operating elemwise. |
I use that pattern all the time. I think part of the confusion here is the difference between Maybe what is really needed here is to introduce new names to clarify behavior. We should have one name that means "apply_function_to_each_element_row". We should have another name that means "apply_function_to_the_series". Leave the existing names ( |
I am -1 on that, too noisy for users, I don’t think that it’s worth it |
A few comments from me about use cases: Regarding the example from @MarcoGorelli, it is the same as using Also, IMO there is not a need to duplicate the functionality of In my suggestion the deprecation message will direct users to use Also, something I haven't highlighted enough is the current difference between Inversely, if a function has been developed for So my suggestion will also streamline
Lines 956 to 980 in fc30823
Note expecially line 973. If you think I'm wrong here, can you please explain. |
I agree. Also, this issue has been hanging for quite a while, so a vote would be nice in order get a decision on this issue. We could include a vote if a PDEP needs to be written, if that's desired (though I don't know if we got procedures for multiple chioce votes?). |
-0.15 on this needing a pdep, +1 on the change @rhshadrach described at the sprint |
@rhshadrach, could you maybe summarize the conclusions from the sprint wrt. this issue? |
Internally, yes, but I would argue that semantically there is another way to think about it. >>> df = pd.DataFrame({"x":[1,2,3],"y":[4,5,6]})
>>> df
x y
0 1 4
1 2 5
2 3 6
>>> df["x"].apply(lambda v: 10*v)
0 10
1 20
2 30
Name: x, dtype: int64
>>> df.apply(lambda r: r["x"] + r["y"], axis=1)
0 5
1 7
2 9
dtype: int64 The pattern shown in Put on the hat of a user. As a user, I find this way of looking at I go back to my earlier suggestion. If the goal is to make the semantic meanings clear, keep the current behavior, and introduce new methods that use the new behavior with clear documentation as to these differences. |
For what its worth, although Styler does not deal with Series, it takes pains to document that This is one of the more frequent errors on stack overflow in questions. I think it helps that now df = pd.DataFrame([1,2])
df.style.map(lambda v: "color: red;" if v > 1 else "") # works
df.style.apply(lambda s: np.where(s>1, "color: red;", "")) # works
df.style.apply(lambda v: "color: red;" if v > 1 else "") # fails. The truth value of a Series is ambiguous. I support the change because I think trying to unify the mindset and function use of users is overall positive. +0 on requiring a PDEP, but initial reaction from more than a couple members for its need suggests it should be considered? |
I too support the change, I'd just like it to have a little more visibility - if it's a pdep, then it's more likely to be shared / talked about at conferences. If others don't want to (which is absolutely fine!), I could do the work of writing up the document |
But those are not the same and have the same general issues I've lined up, i.e. the @MarcoGorelli: I've made a sketch for a PDEP and will push it today. |
I think of it that In thinking about this more., because the behavior of The advantage of totally deprecating |
I've suggested something similar here - shall we move the conversation there, now that there's a pdep? |
Ok for me. @jbrockmendel mentioned though in a comment a discussion about this at the sprint. I wasn't at the sprint, so am not up to date with that discussion. Could someone talk a bit about that, especially if there was a consensus about a path forward? |
I think the change is what's described in the pdep |
During the sprint, I described the issues with the current behavior of DataFrame.apply, Series.apply, DataFrame.agg, and Series.agg and how the code was intertwined. The consensus I took away was that this is an issue we'd like to fix but we are concerned with how noisy of a change this would be. |
I've lately worked on making
Series.map
simpler as part of implementing thena_action
on allExtensionArray.map
methods. As part of that, I made #52033. That PR (and the currentSeriesApply.apply_standard
more generally) very clearly shows howSeries.apply
&Series.map
are very similar, but different enough for it to be confusing when it's a good idea to use one over the other and whenSeries.apply
especially is a bad idea to use.I propose doing some changes in how
Series.apply
work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.
Similarities and differences between
Series.apply
andSeries.map
The similarity between the methods is especially that they both fall back to use
Series._map_values
and there usealgorithms.map_array
orExtensionArray.map
as relevant.The differences are many, but each one is relative minor:
Series.apply
has aconvert_dtype
parameter, whichSeries.map
doesn'tSeries.map
has ana_action
parameter, whichSeries.apply
doesn'tSeries.apply
can take advantage of numpy ufuncs, whichSeries.map
can'tSeries.apply
can takeargs
and**kwargs
, whichSeries.map
can'tSeries.apply
will return a Dataframe, if its result is a listlike of Series, whichSeries.map
won'tSeries.apply
is more general and can take a string, e.g."sum"
, or lists or dicts of inputs whichSeries.map
can't.Also,
Series.apply
is a bit of a parent method ofSeries.agg
&Series.transform
.The problems
The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use
.apply
over.map
to do operations, and vica versa. I will show some examples below.First some setup:
1: string vs numpy funcs in
Series.apply
It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.
Note that giving
np.sum
toDataFrame.apply
aggregates properly:1.5 Callables vs. list/dict of callables (added 2023-04-07)
Also with non-numpy callables:
In both cases above the difference is that
Series.apply
operates element-wise, if given a callable, but series-wise if given a list/dict of callables.2. Functions in
Series.apply
(&Series.transform
)The
Series.apply
doc string have examples with using lambdas, but lambdas inSeries.apply
is a bad practices because of bad performance:Currently,
Series
does not have a method that makes a callable operate on a series' data. Instead users need to useSeries.pipe
for that operation in order for the operation to be efficient:(The reason for the above performance differences is that apply gets called on each single element, while
pipe
callsx.__add__(1)
, which operates on the whole array).Note also that
.pipe
operates on theSeries
whileapply
currently operates on each element in the data, so there is some differences that may have some consequence in some cases.Also notice that
Series.transform
has the same performance problems:3. ufuncs in
Series.apply
vs. inSeries.map
Performance-wise, ufuncs are fine in
Series.apply
, but not inSeries.map
:It's difficult for users to understand why one is fast and the other slow (answer: only
apply
correctly works with ufuncs).It is also difficult to understand why ufuncs are fast in
apply
, while other callables are slow inapply
(answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).4. callables in
Series.apply
is bad, callables inSeriesGroupby.apply
is fineI showed above that using (non-ufunc) callables in
Series.apply
is bad performancewise. OTOH using them inSeriesGroupby.apply
is fine:Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the
apply
op is much larger, and similar to example 2 above.Having callables being ok to use in the
SeriesGroupby.apply
method, but not in theSeries.Apply
is confusing IMO.5: callables in
Series.apply
that return Series transform data to a DataFrameSeries.apply
has an exception that if the callable returns a list-like of Series, the Series will be concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:It's probably never a good idea to use this pattern, and e.g.
.pipe
is much faster, so e.g.large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1}))
will be much faster. If we really do need operation on single element in that fashion it is still possible usingpipe
, e.g.large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func)))
and also just directlypd.DataFrame({"a": large_ser, "b": large_ser.map(some_func)))
.So giving callables that return
Series
toSeries.apply
is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).6.
Series.apply
vs.Series.agg
The doc string for
Series.agg
says about the method'sfunc
parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to
agg
and toapply
".7. dictlikes vs. listlikes in
Series.apply
(added 2023-06-04)Giving a list of transforming arguments to
Series.apply
returns aDataFrame
:But giving a dict of transforming arguments returns a
Series
with aMultiIndex
:These two should give same-shaped output for consistency. Using
Series.transform
instead ofSeries.apply
, it returns aDataFrame
in both cases and I think the dictlike example above should return aDataFrame
similar to the listlike example.Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using
apply
.Proposal
With the above in mind, I propose that:
Series.apply
takes callables that always operate on the series. I.e. letseries.apply(func)
be similar tofunc(series)
+ the needed additional functionality.Series.map
takes callables that operate on each element individually. I.e.series.map(func)
will be similar to the currentseries._map_values(func)
+ the needed additional functionality.convert_dtype
will be deprecated inSeries.apply
(already done in DEPR: Deprecate the convert_dtype param in Series.Apply #52257).convert_dtype
will NOT be added toSeries.map
(comment) by @rhshadrach).Series.apply
to convert alist[Series]
to a DataFrame will be deprecated (already done in DEPR: Deprecate returning a DataFrame in SeriesApply.apply_standard #52123).list[Series]
to a DataFrame will NOT be added toSeries.map
.Series.apply
will propagate toSeries.agg
andSeries.transform
.The difference between
Series.apply()
&Series.map()
will then be that:Series.apply()
makes the passed-in callable operate on the series, similarly to how(DataFrame|SeriesGroupby|DataFrameGroupBy).apply.
operate on series. This is very fast and can do almost anything,Series.map()
makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used ifSeries.apply
can't do it.so, IMO, this API change will help make Pandas
Series.(apply|map)
API simpler without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.Deprecation process
The cumbersome part of the deprecation process will be to change
Series.apply
to only work array-wise, ie. to dofunc(series._values)
always. This can be done by adding anarray_ops_only
parameter toSeries.apply
, so:and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.
The other changes are more easy:
convert_dtype
inSeries.apply
will be deprecated just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.The text was updated successfully, but these errors were encountered: