Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-8: Inplace methods in pandas #51466

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Feb 17, 2023

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

Closes #16529

Co-Authored-By: Joris Van den Bossche <[email protected]>
Co-Authored-By: Patrick Hoefler <[email protected]>
@lithomas1 lithomas1 added Copy / view semantics inplace Relating to inplace parameter or equivalent PDEP pandas enhancement proposal labels Feb 17, 2023
@datapythonista datapythonista changed the title PDEP 8: Inplace methods in pandas PDEP-8: Inplace methods in pandas Feb 26, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is well-written!

Comment on lines 20 to 22
- For example, the ``fillna`` method would retain its ``inplace`` keyword, while ``dropna`` (potentially shrinks the
length of a ``DataFrame``/``Series``) and ``rename`` (alters labels not values) would lose their ``inplace``
keywords
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify - even with inplace, fillna isn't necessarily inplace, right? might be worth clarifying that it's an operation which has the possibility of operating inplace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep correct, example see below

Comment on lines 32 to 33
- What should happen when ``inplace=True`` but the original pandas object cannot be operated inplace on because it
shares its values with another object?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be an example, something like

In [11]: lst = [1,2,3]

In [12]: ser = pd.Series([lst])

In [13]: ser
Out[13]:
0    [1, 2, 3]
dtype: object

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, nested data are not covered by CoW right now, it's only about arrays and pandas objects. An example would be the following:

ser = Series([1, 2, 3])
view = ser[:]
# Both share the same data since [:] creates a view
ser.replace(1, 5, inplace=True)  # makes a copy of the data because we don't want to update view

|:--------------|
| ``insert`` |
| ``pop`` |
| ``update`` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is update always inplace? e.g. in

In [43]: df1 = pd.DataFrame({'numbers': [1, np.nan]})

In [44]: df2 = pd.DataFrame({'numbers': ['foo', 'bar']})

In [45]: df1.update(df2)

Copy link
Member

@phofl phofl Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always operates inplace on the pandas object, but not necessarily on the underlying data/array. Your example would update df1, but the array would be copied before the update actually happens. Same as with replace and friends when the value to set is not compatible with the array dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but then should it belong in group 2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no inplace keyword, so we don't have to change anything, same with insert and pop

Copy link
Member

@MarcoGorelli MarcoGorelli Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the title then

methods that always operate inplace

As in, methods that always operate inplace on the pandas object?

Because in another part of the tutorial the wording

Some of the methods with an inplace keyword can actually work inplace

is used - I find it confusing when you're talking about "actually working inplace" and "modifying the pandas object"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would make it clearer, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the keywords do we just go all in and make those copy_pandas_object and copy_underlying_data? Or something similar.

Obviously more verbose but yea I agree with @MarcoGorelli sentiment that inplace is vague to the point of misinterpretation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_pandas_object won't be necessary with CoW. We will only need a keyword where we actually can modify the underlying data inplace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd to be clear, my comment above was about terminology to use in the text explaining this, not for actual keyword names (from your comment I get the impression you are talking about actual keywords)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably way too abstracted from all of this, but find the terminology confusing. I didn't realize that CoW only ever referred to the pandas "container" object and didn't take a point of view on the underlying data. So if that' is the long term commitment then yea copy_pandas_object is a bit redundant, although I wonder if there is other terminology we need to cover if/when CoW extends to the underlying data

Generally it feels like there is a lot to be gained from a more explicit keyword than inplace when we want to modify the underlying data; I think is in line with @Dr-Irv points

web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Feb 28, 2023

I would say this proposal has a fatal flaw. What's to stop inplace / copy keywords being added back going forwards?

Further the notion of an inplace method returning self is just very confusing in itself.

Thus i would be -1 on a partial change as this PDEP proposes.

That said I would be +1 on removing all uses of inplace (to the extent a copy keyword is needed for CoW compat is needed that is unavoidable)

Reasons are similar for what you are proposing.

  • performance is not meaningful with CoW
  • syntax with inplace is not great

@phofl
Copy link
Member

phofl commented Feb 28, 2023

Adding back inplace in other methods than suggested in the PDEP is useless, because with CoW all these methods already return views if possible, so no benefit of adding inplace/copy back in.

The main issue with methods in group 2 is that they will always create copies under CoW because modifying the underlying array inplace violates CoW principles when returning a new object.

df = DataFrame({"a": [1.5, np.nan, 2.5]})
df2 = df.fillna(100.5)

If we modify the array without copying, we would always change df and df2. With inplace this could avoid copying, but yeah I agree that the syntax with inplace is terrible.

Copy link
Member Author

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some suggestions for edits in response to the review comments. Planning on committing this in a couple of days if no comments.

web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
@lithomas1
Copy link
Member Author

I'm also wondering if we should add an inplace keyword to EA methods as well.

Currently fillna is the only EA method that would benefit from this, but interpolate (#50950) would benefit from this too, if it was added to the EA interface.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think this PDEP is too big, and a bit difficult to read and discuss. Feels like we could have a much simpler PDEP only for groups 3 and 4, see if there is consensus and try to get that approved, and then take the rest separately (in one or even more specific PRs).

Of course a bit of context is missed, but personally feels like it's not a problem to address the removal of the inplace parameter where it's not needed, and then discuss the cases where it can make more sense.

@simonjayhawkins
Copy link
Member

If Copy-on-Write is not made the default in the future what are the implications for this PDEP?

actual" shallow copy, but protected under Copy-on-Write), but those behaviour changes are covered by the Copy-on-Write
proposal[^1].

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this section 'Rejected Ideas'.

@simonjayhawkins
Copy link
Member

Personally I think this PDEP is too big, and a bit difficult to read and discuss.

maybe worth trying to avoid repetition to reduce the word count. For instance, where there are bullet points, be more concise where the detail is expanded on later.

Comment on lines +74 to +76
Note: there are also operations (not methods) that work inplace in pandas, such as indexing (
e.g. `df.loc[0, "col"] = val`) or inplace operators (e.g. `df += 1`). This is out of scope for this PDEP, as we focus on
the inplace behaviour of DataFrame and Series _methods_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a subsection for out of scope and pull down the out of scope related items from the abstract

@simonjayhawkins
Copy link
Member

Personally I think this PDEP is too big, and a bit difficult to read and discuss.

maybe worth trying to avoid repetition to reduce the word count. For instance, where there are bullet points, be more concise where the detail is expanded on later.

could the Open Questions section be completely removed and maybe open issues instead?

@jorisvandenbossche
Copy link
Member

Personally I think this PDEP is too big, and a bit difficult to read and discuss.

@datapythonista If possible to be more specific, are there specific sections you find difficult to read that we would improve, or something in the general flow / structure of the document that is confusing?

Yes, the text is quite long, but that's also because we wanted to big thorough and complete and clear (but for sure we can improve on that based on feedback).
Personally I think we can certainly try to discuss this as a single PDEP, many of the aspects are very much related.


If we want to reduce the size, I would personally remove most of the discussion of the copy keyword (apart from a brief mention how it interacts with inplace, and in the rejected ideas section), and focus on only the inplace keyword. In the end, copy=False does not do anything inplace, it's just another way to avoid a copy (one of the reasons to use inplace=True as well). But the fact that the copy keyword essentially becomes obsolete is entirely due to Copy-on-Write (PDEP-7). In fact, we already implemented that in main when CoW is enabled.

@datapythonista
Copy link
Member

@datapythonista If possible to be more specific, are there specific sections you find difficult to read that we would improve, or something in the general flow / structure of the document that is confusing?

From past discussions, I think it should be easy to reach consensus to remove inplace from where it doesn't make sense. I think a PDEP dedicated to that should be much simpler and focussed, and we can get that part out of the question. And I personally see it as independent from how inplace should behave in the places where it makes sense.

Mixing all together with the different categories of functions and methods is what I had in mind when I said it's a bit difficult to read (or in my mind, more difficult than two separate PDEPs).

The part that I found confusing is the last part of Proposed changes and reasoning. It's not clear if the proposal is to use df = df.replace(..., inplace='True'). Also, I should get more into the details of CoW probably, but without a very good understanding about that, it's not easy to understand the reasoning why having inplace is better than just replacing the data (or making the copy as you call it).

In any case, excellent work with the content of the PDEP. To be clear, my comments are only related to possible ways to make this PDEP easier to understand and discuss. The analysis is great, and I think this PDEP is very valuable. We've been discussing about this since before pandas 1.0... :)

| `reindex_like` | `copy` |
| `truncate` | `copy` |

Although all of these methods either `inplace` or `copy`, they can never operate inplace because the nature of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"these methods have either the inplace or copy keywords, they can never..."

The original object would be modified before the reference goes out of scope.

To avoid triggering a copy when a value would actually get replaced, we will keep the `inplace` argument for those
methods.
Copy link
Contributor

@Dr-Irv Dr-Irv Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only group of methods that retain the inplace keyword, can we then change the return type to be the object itself instead of None ? I see you address this below, but maybe make a mention here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is one of the ideas we had how to make this more compatible with the rest

@rhshadrach
Copy link
Member

2. with inplace=True silently falling back to a copy is an issue when you manipulate large data. I would largely prefer an exception

If it were always clear whether an operation can be done inplace before hand, I wouldn't have too much opposition. However, I don't believe that's the case. I'd be in favor of a warning if we implement #55385.

3. returning self with inplace=True goes against the Python convention.

Where is this convention written down?

@arnaudlegout
Copy link
Contributor

arnaudlegout commented Nov 14, 2023

If it were always clear whether an operation can be done inplace before hand, I wouldn't have too much opposition. However, I don't believe that's the case. I'd be in favor of a warning if we implement #55385.

I wasn't aware of #55385. An opt-in performance warning is by far the best option. However, if #55385 is not implemented, raising in case of a copy with inplace=True will be an improvement. Indeed, you are right that it was never clear wether an operation was done inplace, and my understanding is that CoW is the attempt to solve two issues: make the behavior consistent, make the performance predictable. Having an inplace=True that silently fall back to a copy will be a regression in that respect.

 3. returning self with inplace=True goes against the Python convention.

Where is this convention written down?

This behavior is explicitly described in the official documentation. See for instance https://docs.python.org/3/faq/design.html#why-doesn-t-list-sort-return-the-sorted-list
Also, I am not aware of any counter example in the builtins and stdlib.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 14, 2023

Where is this convention written down?

This behavior is explicitly described in the official documentation. See for instance docs.python.org/3/faq/design.html#why-doesn-t-list-sort-return-the-sorted-list

While I agree this is an argument for not returning an instance of self, I find calling this a "convention" far too strong. In the case of pandas, it seems clear to me the benefits of returning the instance - namely promoting method chaining - outweighs this drawback.

@jorisvandenbossche
Copy link
Member

Thanks for the feedback, @arnaudlegout!

2. with inplace=True silently falling back to a copy is an issue when you manipulate large data. I would largely prefer an exception (e.g., InplaceError) when I develop my code

I was also going to suggest whether a warning would be sufficient for this use case.
Personally I think an error would be too annoying. Especially when developing, you might be testing code interactively, and when doing that you will easily trigger a copy (eg because there is a reference from checking some values interactively, like the IPython example that is included in the PDEP). Getting those exceptions while testing things interactively will become annoying IMO.

@arnaudlegout
Copy link
Contributor

While I agree this is an argument for not returning an instance of self, I find calling this a "convention" far too strong. In the case of pandas, it seems clear to me the benefits of returning the instance - namely promoting method chaining - outweighs this drawback.

My proposition was not to prevent method chaining, but to make it explicit (when inplace=True) with a method argument (e.g., chain=True). The behavior you propose does not exist in Python builtins and does not exist in numpy. Making it implicit might be a difficulty for new comers. As in-place modification is quite rare in the pandas API, I believe adding a method argument to make the chaining explicit is not an overkill. I understand that a drawback is that if you chain methods and forget the chain=True argument then you will introduce a bug. This drawback might be too important for the benefit.

Another option is to explain in the documentation of each method with an inplace arguement that inplace=True makes the change in place and the method will return self. May be this is enough. At least, I want to make clear that as the proposed behavior could be surprising there must be a clear documentation of the effect of inplace=True.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the proposal. Suggested some grammatical and wording changes for clarificaiton.

Comment on lines +68 to +69
e.g. `df.loc[0, "col"] = val`) or inplace operators (e.g. `df += 1`). This is out of scope for this PDEP, as we focus on
the inplace behaviour of DataFrame and Series _methods_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to say "as we focus on the inplace behavior of DataFrame and Series methods, whether or not those methods have an inplace keyword."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true we also list the methods that don't have an inplace keyword, but there is not any actual change for those included in the PDEP (I think we list them more for completeness), so not sure it is needed to focus on that aspect here.

web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0008-inplace-methods-in-pandas.md Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Another option is to explain in the documentation of each method with an inplace arguement that inplace=True makes the change in place and the method will return self. May be this is enough. At least, I want to make clear that as the proposed behavior could be surprising there must be a clear documentation of the effect of inplace=True.

Yes, if we do this, we will certainly have to document this very well (personally, given this is not the default behaviour and so most users won't run into this, I think I am personally fine with the atypical behaviour of returning self)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 20, 2023

Another option is to explain in the documentation of each method with an inplace arguement that inplace=True makes the change in place and the method will return self. May be this is enough. At least, I want to make clear that as the proposed behavior could be surprising there must be a clear documentation of the effect of inplace=True.

Yes, if we do this, we will certainly have to document this very well (personally, given this is not the default behaviour and so most users won't run into this, I think I am personally fine with the atypical behaviour of returning self)

I did think of a place where it might be surprising. If you use pandas in a notebook (or even in just interactive python), and you had a cell in a notebook where the last line uses the inplace=True argument, under current behavior, there would be no output for that cell. With the change to returning self, the output would then be displayed.

I still like the idea of returning self, but thought it was worthwhile to point out this change in behavior.

@jorisvandenbossche
Copy link
Member

@Dr-Irv thanks for the review! Committed your grammatical suggestions (I clearly need to learn the proper usage of either / neither ;))

@jorisvandenbossche
Copy link
Member

@pandas-dev/pandas-core any more comments? Otherwise I am planning to open a vote in a few days

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this version.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2023

As i indicated in #56507

I really want NO inplace at all, but if there is to be inplace. There is NO way these should be returning self at all. This is just a terrible idea. It happened about 10 years ago and took ages to revert. This re-inforces a model that inplace is actually ok. When in fact it is generally not.

@jbrockmendel
Copy link
Member

the 10 years ago thing would have been before type annotations. having a different return type depending on inplace makes annotations really ungainly

@jreback
Copy link
Contributor

jreback commented Dec 16, 2023

oh i agree that the type annotations are a problem
my point is that returning self on inplace was a mistake 10 years ago and is a mistake now

so -1 on returning self
this simply is encouraging wrong behavior

@jorisvandenbossche
Copy link
Member

It happened about 10 years ago and took ages to revert

I don't remember that we changed this at some point (and so also no idea what the reasons would have been to do so). Do you have some link to that discussion or issue/PR?

Can you explain a bit more why you think this is a bad idea? You mention that it encourages the use of the keyword, but 1) many people use it for not having to re-assign to the same variable or invent a new variable name (i.e. not having to do df_long_name = df_long_name.method(inplace=True)), but those use cases are mostly deprecated (and will go a away, since most methods will loose the inplace keyword) and for this use case you also don't care about the return value (thus, it will not encourage this use case), and 2) for those methods that keep the inplace keyword, such as fillna, it is not "wrong behaviour" to do the fillna inplace in a method chain (which is what returning self enables). It simply avoids one additional copy of the data.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2023

@jorisvandenbossche

#1893 it wasn't that hard to fine.

Having these return self also now adds a HUGE amount of complexity to the api. you have the standard inplace methods, such as insert and .loc which return None, now you are adding a different case for these inplace methods. tbh I would either:

  • remove all inplace everywhere; the supposed performance benefefits of having 4 methods with inplace is dubious
  • if you insist on the above, then reanme these to ffill_inplace and so on (to avoid the typing issue); don't love this, these still should return None

@jbrockmendel
Copy link
Member

id be on board with making insert etc always return a new object, but that is probably out of scope here

@jreback
Copy link
Contributor

jreback commented Dec 18, 2023

my one other objection here is that this diverges from the standard library as well, so to me, returning self is -1

@jorisvandenbossche
Copy link
Member

#1893 it wasn't that hard to fine.

Having these return self also now adds a HUGE amount of complexity to the api. you have the standard inplace methods, such as insert and .loc which return None, now you are adding a different case for these inplace methods.

Thanks for that link. Quoting Wes from that issue:

At some point I'd decided it was better to have a consistent API w.r.t. return values (vs. None in the inplace=True case), e.g. whether or not inplace=True, you can always count on getting a reference back to the modified object.

So while you consider this a complication of the API, it's also a simplification in other ways. In the PDEP, we list those different advantages and disadvantages (here, although we could have added a more explicit disadvantage about returning self being atypical for an inplace method vs other inplace methods without a keyword), and make the trade-off in favor of returning self:

we think the advantages of simplifying return types and enabling methods chains outweighs the special case of returning an identical object.

From my quick reading of #1893, the main argument that is made in the issue is that returning self is not needed because they exactly use inplace=True to avoid the re-assignment. But with this PDEP, we explicitly decide to no longer consider the "no re-assignment" to be a use case of inplace=True. Sure, for those methods where we keep the keyword, you can still do it. But we don't consider that the reason for using the keyword (hence, we propose to remove the keyword in most methods. Such as in set_index, which was one of the methods that are mentioned in #1893).

It is true that this deviates from the inplace methods like insert and update (note that .loc is not a method, it's a setitem operation, that doesn't return anything), or inplace methods in the stdlib or in numpy. But a big difference is that the methods that are under discussion in this PR are not "pure" inplace methods. They all, by default, don't work inplace, and have an explicit inplace keyword (which none of the stdlib or numpy methods have), and so we can very explicitly define and document what the behaviour of this keyword is.
(I would personally also say that those few pure inplace methods we still have in pandas, like insert and pop, are more historical artifacts and it are those methods that are inconsistent with the typical pandas API)

@Rinfore
Copy link

Rinfore commented Jan 2, 2024

Does this change break a lot of uses of the pandas dataframe/series extension API?

It is argued that this reason for inplace is not adequate: "To save the result to the same variable / update the original variable (avoid the pattern of reassigning to the same variable)".

But it breaks cases such as below (toy example):

@pd.api.extensions.register_dataframe_accessor('stocks')
class StockAccessor:

    def __init__(self, df: pd.DataFrame) -> None:
        self._validate(pandas_obj)
        self._df = df

    @staticmethod
    def _validate(df):
        # verify stock object has price and ticker
        if "ticker" not in df.columns or "price" not in df.columns:
            raise AttributeError("Must have 'ticker' and 'price'.")

    def drop_invalid_stocks(self):
        self._df.dropna(subset=['ticker', 'price'], inplace=True)

because doing

self._df = self._df.dropna(subset=['ticker', 'price']) within drop_invalid_stocks

only updates a reference internal to the accessor. It is very difficult to update the reference in the correct location from within an accessor. We need to be able to mutate the data frame directly.

@MarcoGorelli
Copy link
Member

thanks @Rinfore - I think the idea is that you should make drop_invalid_stocks non-inplace as well, so you'd have

    def drop_invalid_stocks(self) -> pd.DataFrame:
        return self._df.dropna(subset=['ticker', 'price'])

and use it as

df = df.stocks.drop_invalid_stocks()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics inplace Relating to inplace parameter or equivalent PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/DEPR: Deprecate inplace parameter