-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-8: Inplace methods in pandas #51466
base: main
Are you sure you want to change the base?
Conversation
Co-Authored-By: Joris Van den Bossche <[email protected]> Co-Authored-By: Patrick Hoefler <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is well-written!
- For example, the ``fillna`` method would retain its ``inplace`` keyword, while ``dropna`` (potentially shrinks the | ||
length of a ``DataFrame``/``Series``) and ``rename`` (alters labels not values) would lose their ``inplace`` | ||
keywords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify - even with inplace
, fillna
isn't necessarily inplace, right? might be worth clarifying that it's an operation which has the possibility of operating inplace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep correct, example see below
- What should happen when ``inplace=True`` but the original pandas object cannot be operated inplace on because it | ||
shares its values with another object? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be an example, something like
In [11]: lst = [1,2,3]
In [12]: ser = pd.Series([lst])
In [13]: ser
Out[13]:
0 [1, 2, 3]
dtype: object
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, nested data are not covered by CoW right now, it's only about arrays and pandas objects. An example would be the following:
ser = Series([1, 2, 3])
view = ser[:]
# Both share the same data since [:] creates a view
ser.replace(1, 5, inplace=True) # makes a copy of the data because we don't want to update view
|:--------------| | ||
| ``insert`` | | ||
| ``pop`` | | ||
| ``update`` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is update
always inplace? e.g. in
In [43]: df1 = pd.DataFrame({'numbers': [1, np.nan]})
In [44]: df2 = pd.DataFrame({'numbers': ['foo', 'bar']})
In [45]: df1.update(df2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It always operates inplace on the pandas object, but not necessarily on the underlying data/array. Your example would update df1, but the array would be copied before the update actually happens. Same as with replace and friends when the value to set is not compatible with the array dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but then should it belong in group 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no inplace keyword, so we don't have to change anything, same with insert and pop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the title then
methods that always operate inplace
As in, methods that always operate inplace on the pandas object?
Because in another part of the tutorial the wording
Some of the methods with an
inplace
keyword can actually work inplace
is used - I find it confusing when you're talking about "actually working inplace" and "modifying the pandas object"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would make it clearer, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the keywords do we just go all in and make those copy_pandas_object and copy_underlying_data? Or something similar.
Obviously more verbose but yea I agree with @MarcoGorelli sentiment that inplace is vague to the point of misinterpretation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_pandas_object
won't be necessary with CoW. We will only need a keyword where we actually can modify the underlying data inplace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd to be clear, my comment above was about terminology to use in the text explaining this, not for actual keyword names (from your comment I get the impression you are talking about actual keywords)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably way too abstracted from all of this, but find the terminology confusing. I didn't realize that CoW only ever referred to the pandas "container" object and didn't take a point of view on the underlying data. So if that' is the long term commitment then yea copy_pandas_object
is a bit redundant, although I wonder if there is other terminology we need to cover if/when CoW extends to the underlying data
Generally it feels like there is a lot to be gained from a more explicit keyword than inplace
when we want to modify the underlying data; I think is in line with @Dr-Irv points
I would say this proposal has a fatal flaw. What's to stop inplace / copy keywords being added back going forwards? Further the notion of an inplace method returning self is just very confusing in itself. Thus i would be -1 on a partial change as this PDEP proposes. That said I would be +1 on removing all uses of inplace (to the extent a copy keyword is needed for CoW compat is needed that is unavoidable) Reasons are similar for what you are proposing.
|
Adding back inplace in other methods than suggested in the PDEP is useless, because with CoW all these methods already return views if possible, so no benefit of adding inplace/copy back in. The main issue with methods in group 2 is that they will always create copies under CoW because modifying the underlying array inplace violates CoW principles when returning a new object.
If we modify the array without copying, we would always change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some suggestions for edits in response to the review comments. Planning on committing this in a couple of days if no comments.
I'm also wondering if we should add an inplace keyword to EA methods as well. Currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think this PDEP is too big, and a bit difficult to read and discuss. Feels like we could have a much simpler PDEP only for groups 3 and 4, see if there is consensus and try to get that approved, and then take the rest separately (in one or even more specific PRs).
Of course a bit of context is missed, but personally feels like it's not a problem to address the removal of the inplace parameter where it's not needed, and then discuss the cases where it can make more sense.
If Copy-on-Write is not made the default in the future what are the implications for this PDEP? |
actual" shallow copy, but protected under Copy-on-Write), but those behaviour changes are covered by the Copy-on-Write | ||
proposal[^1]. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this section 'Rejected Ideas'.
maybe worth trying to avoid repetition to reduce the word count. For instance, where there are bullet points, be more concise where the detail is expanded on later. |
Note: there are also operations (not methods) that work inplace in pandas, such as indexing ( | ||
e.g. `df.loc[0, "col"] = val`) or inplace operators (e.g. `df += 1`). This is out of scope for this PDEP, as we focus on | ||
the inplace behaviour of DataFrame and Series _methods_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a subsection for out of scope and pull down the out of scope related items from the abstract
could the Open Questions section be completely removed and maybe open issues instead? |
@datapythonista If possible to be more specific, are there specific sections you find difficult to read that we would improve, or something in the general flow / structure of the document that is confusing? Yes, the text is quite long, but that's also because we wanted to big thorough and complete and clear (but for sure we can improve on that based on feedback). If we want to reduce the size, I would personally remove most of the discussion of the |
From past discussions, I think it should be easy to reach consensus to remove Mixing all together with the different categories of functions and methods is what I had in mind when I said it's a bit difficult to read (or in my mind, more difficult than two separate PDEPs). The part that I found confusing is the last part of In any case, excellent work with the content of the PDEP. To be clear, my comments are only related to possible ways to make this PDEP easier to understand and discuss. The analysis is great, and I think this PDEP is very valuable. We've been discussing about this since before pandas 1.0... :) |
| `reindex_like` | `copy` | | ||
| `truncate` | `copy` | | ||
|
||
Although all of these methods either `inplace` or `copy`, they can never operate inplace because the nature of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"these methods have either the inplace
or copy
keywords, they can never..."
The original object would be modified before the reference goes out of scope. | ||
|
||
To avoid triggering a copy when a value would actually get replaced, we will keep the `inplace` argument for those | ||
methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the only group of methods that retain the inplace
keyword, can we then change the return type to be the object itself instead of None
? I see you address this below, but maybe make a mention here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is one of the ideas we had how to make this more compatible with the rest
If it were always clear whether an operation can be done inplace before hand, I wouldn't have too much opposition. However, I don't believe that's the case. I'd be in favor of a warning if we implement #55385.
Where is this convention written down? |
I wasn't aware of #55385. An opt-in performance warning is by far the best option. However, if #55385 is not implemented, raising in case of a copy with
This behavior is explicitly described in the official documentation. See for instance https://docs.python.org/3/faq/design.html#why-doesn-t-list-sort-return-the-sorted-list |
While I agree this is an argument for not returning an instance of |
Thanks for the feedback, @arnaudlegout!
I was also going to suggest whether a warning would be sufficient for this use case. |
My proposition was not to prevent method chaining, but to make it explicit (when Another option is to explain in the documentation of each method with an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the proposal. Suggested some grammatical and wording changes for clarificaiton.
e.g. `df.loc[0, "col"] = val`) or inplace operators (e.g. `df += 1`). This is out of scope for this PDEP, as we focus on | ||
the inplace behaviour of DataFrame and Series _methods_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to say "as we focus on the inplace behavior of DataFrame
and Series
methods, whether or not those methods have an inplace
keyword."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true we also list the methods that don't have an inplace
keyword, but there is not any actual change for those included in the PDEP (I think we list them more for completeness), so not sure it is needed to focus on that aspect here.
Yes, if we do this, we will certainly have to document this very well (personally, given this is not the default behaviour and so most users won't run into this, I think I am personally fine with the atypical behaviour of returning self) |
I did think of a place where it might be surprising. If you use pandas in a notebook (or even in just interactive python), and you had a cell in a notebook where the last line uses the I still like the idea of returning |
Co-authored-by: Irv Lustig <[email protected]>
@Dr-Irv thanks for the review! Committed your grammatical suggestions (I clearly need to learn the proper usage of either / neither ;)) |
@pandas-dev/pandas-core any more comments? Otherwise I am planning to open a vote in a few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this version.
As i indicated in #56507 I really want NO inplace at all, but if there is to be inplace. There is NO way these should be returning |
the 10 years ago thing would have been before type annotations. having a different return type depending on inplace makes annotations really ungainly |
oh i agree that the type annotations are a problem so -1 on returning self |
I don't remember that we changed this at some point (and so also no idea what the reasons would have been to do so). Do you have some link to that discussion or issue/PR? Can you explain a bit more why you think this is a bad idea? You mention that it encourages the use of the keyword, but 1) many people use it for not having to re-assign to the same variable or invent a new variable name (i.e. not having to do |
#1893 it wasn't that hard to fine. Having these return self also now adds a HUGE amount of complexity to the api. you have the standard
|
id be on board with making insert etc always return a new object, but that is probably out of scope here |
my one other objection here is that this diverges from the standard library as well, so to me, returning self is -1 |
Thanks for that link. Quoting Wes from that issue:
So while you consider this a complication of the API, it's also a simplification in other ways. In the PDEP, we list those different advantages and disadvantages (here, although we could have added a more explicit disadvantage about returning self being atypical for an inplace method vs other inplace methods without a keyword), and make the trade-off in favor of returning self:
From my quick reading of #1893, the main argument that is made in the issue is that returning It is true that this deviates from the inplace methods like |
Does this change break a lot of uses of the pandas dataframe/series extension API? It is argued that this reason for inplace is not adequate: "To save the result to the same variable / update the original variable (avoid the pattern of reassigning to the same variable)". But it breaks cases such as below (toy example):
because doing
only updates a reference internal to the accessor. It is very difficult to update the reference in the correct location from within an accessor. We need to be able to mutate the data frame directly. |
thanks @Rinfore - I think the idea is that you should make def drop_invalid_stocks(self) -> pd.DataFrame:
return self._df.dropna(subset=['ticker', 'price']) and use it as df = df.stocks.drop_invalid_stocks() |
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage
Closes #16529