ENH: Add 'inplace' parameter to DataFrame.append() #2801

darindillon · 2013-02-05T21:24:36Z

DataFrame.append() ought to have a "inplace=True" parameter to allow modifying the existing dataframe rather than copying it. This would be a big performance gain for large dataframes.

wesm · 2013-02-06T00:49:29Z

It actually wouldn't because new arrays still have to be allocated and the data copied over

darindillon · 2013-02-06T00:54:19Z

Hmm, interesting. Well, it would be convenient to have the parameter anyway, just to simplify code (even if there's no performance boost)

jreback · 2013-02-06T01:17:26Z

i often append to really big tables on disk (using HDFStore)

http://pandas.pydata.org/pandas-docs/stable/io.html#storing-in-table-format

knowitnothing · 2013-05-08T22:42:31Z

Isn't it possible to pre-alloc a larger-than-initially-needed DataFrame (possibly via a parameter) and make short appends efficient ? It would be nice to combine that with resizes that go beyond the imediate needs, reducing reallocations. This should be all obvious, and since I never touched Pandas code I guess there is some impeding reason for not doing that ?

The case I'm thinking about is that of data coming in real-time, and then one appends a DataFrame with a single entry to a larger one.

jreback · 2013-05-08T23:13:36Z

can you give an example of how you are using this (and include some parameters that would 'simulate' what you are doing?)

as an aside, a way of possibly mitigate this is to create new frames every so often (depends on your frequency of updates), then concat them together in one fell swoop (so you are appending to only a very small frame)

knowitnothing · 2013-05-08T23:23:41Z

I'm not using Pandas for that case I mentioned, but I'm considering it. I guess by "an example" you mean an extended version of that last phrase I included in the previous comment ?

So here is the extended example: the program receives live data from a given exchange. Let us restrict that to "trade" data, i.e. if a sell order or a buy order is filled in a given a exchange, the program receives a message telling that a buy/sell order was filled at a given price and a given volume. There might be additional details, but they are irrelevant here. So, suppose this exchange is just starting and the first trade on it just happened. Create a DataFrame for it. Now a new trade happened, append the just received to the earlier DataFrame. And so on. It is very interesting to use Pandas to resample this DataFrame up-to-the-last update so we can apply different analysis on it, in real time. It is also very interesting that the DataFrame can be stored in HDF5, while not a Pandas feature, it provides an easy way to do so. It might be the case that appending data to HDF5 is fast enough for this situation, and Pandas can retrieve the appended-DataFrame from the storage fast enough too. I have no benchmark data for this, by the way.

jreback · 2013-05-09T00:32:31Z

appending to HDF5 will be very easy to do here, to save a record of what you are doing, and you will be able to read from that HDF5 (in the same process and sequentially), e.g. you write, then read, and do your processing.

Doing this in separate processes is problematic; there is no 'locking' of the HDF5 file per se.

This is still allocating memory for the entire read back

There is nothing conceptually wrong with appending to an existing frame, it has to allocate new memory, but unless you are dealing with REALLY big frames, this shouldn't be a problem

In [1]: df = DataFrame(randn(100000,2),columns=list('AB'))

In [2]: df2 = DataFrame(randn(10,2),columns=list('AB'))

In [3]: %timeit df.append(df2)
1000 loops, best of 3: 431 us per loop

I suspect your bottleneck will not be this at all, but the actual operations you want to do on the frame

In [7]: df3 = df.append(df2)

In [8]: %timeit df3.mean()
100 loops, best of 3: 3.04 ms per loop

Write your program and profile it

my favorite saying: premature optimization is the root of all evil

knowitnothing · 2013-05-09T01:01:42Z

The dataframes can get big, but I guess it depends on what you mean by big. I have this data stored in another format taking ~5 million rows right now, "importing" it to a DataFrame is a one-time-heavy process but that is fine. I'm worried about reallocing 5 mil + 1, 5 mil + 1 + 1, for each append.

If the implementation takes O(n) for something that could be amortized to O(1) then this could become a bottleneck (or maybe already is for some given application, which then moved on to something else).

jreback · 2013-05-09T01:05:30Z

your are much better off doing a marginal calculation anyhow

if u are adding 1 point to 5m then it doesn't affect the stats of the 5m
so I would just calc the stats u need, write it to hdf for storage and later retrieval and do your calc
should be much more efficient

knowitnothing · 2013-05-09T01:09:10Z

Thus my earlier point: "... It might be the case that appending data to HDF5 is fast enough for this situation ...". I would actually continuously store new data in HDF5 by appending to what I currently have. And then I would use a subset of this stored DataFrame to do the analysis.

The possible advantage of not using HDF5 is that it we could guarantee that all the data is in memory, otherwise we have to trust on HDF5 being good/fast enough.

jreback · 2013-05-09T01:13:00Z

Here's a way to preallocate
create the frame bigger than you need (e.g. the existing + the expected)

fill in rows, increment your indexer (realloc if you run out of space)
calc your function that selects <= the indexer
repeat

In [7]: df = DataFrame(index=range(5),columns=list('AB'))

In [8]: df.iloc[0] = Series(dict(A = 10, B = 5))

In [9]: df.iloc[1] = Series(dict(A = 11, B = 6))

In [10]: def f(x,indexer):
   ....:     return x.iloc[0:indexer]*2
   ....: 

In [11]: f(df,2)
Out[11]: 
    A   B
0  20  10
1  22  12

In [12]: df.iloc[2] = Series(dict(A = 12, B = 7))

In [13]: f(df,3)
Out[13]: 
    A   B
0  20  10
1  22  12
2  24  14

jreback · 2013-05-09T01:14:20Z

you can do a combination of all of these approaches, you know your data and your workflow best

knowitnothing · 2013-05-09T01:17:53Z

The problem with your prealloc example is that you know the index values, I don't know them beforehand. Can you set index to NaN and later modify it without incurring more than constant time ? Thinking about this.. I guess I could use timestamp_{i-1} + 1 nanosecond for the prealloc. But I would still need to update the index when inserting actual data. Is that possible ? It would mostly solve the initial suggestion.

jreback · 2013-05-09T01:24:25Z

use the index like I did, add your 'index' as another column (which can be nan, then fill in as u fill the rows), then

func(df.iloc[0:indexer].set_index('my_index'))

knowitnothing · 2013-05-09T01:25:42Z

I will properly evaluate these suggestions, thank you :)

jreback · 2013-05-09T01:46:32Z

good luck

ghost · 2016-06-16T15:13:56Z

hey "premature optimization is the root of all evil"! Awesome quote! Strange that this issue is closed and I get "TypeError: append() got an unexpected keyword argument 'inplace'".
I know with scientists all variables are usually global. But if you attempt to do a proper software design (using methods and arguments) and you want to append to a dataframe in a callback somewhere this breaks the design. Back to evil global variables again!

vincent-yao27 · 2018-11-29T15:04:43Z

@jreback A inplace parameter for append() is really needed in for..in loops.

for df in df_list:
  df = df.append(...) # no effects on df_list

In the case above, there are still counter-intuitive workarounds like

for idx in range(len(df_list)):
  df_list[idx] = df_list[idx].append(...)

However, in some case, it just doesn't work.

A_df_list, B_df_list = ...
df_list = A_df_list + B_df_list
for idx in range(len(df_list)):
  df_list[idx] = df_list[idx].append(...)
  # no effects on A_df_list and B_df_list

NumesSanguis · 2018-12-27T05:46:59Z

@jreback , I agree with @vincent-yao27 . An inplace=True parameter would be useful in for loops when you deal with multiple dataframes.
It is even more useful when you have e.g. a function that takes series to append to a dataframe:

def add_ds_to_df(df, ds):
    df = df.append(ds, ignore_index=True)
    return df  # unnecessary need to return dataframe

With inplace:

def add_ds_to_df(df, ds):
    df.append(ds, ignore_index=True, inplace=True)

zdwhite · 2019-07-09T22:25:12Z

Why is this issue closed a year and a half on???

Anntuanette · 2019-07-10T15:07:41Z

inplace option is very much needed when you modify a table using procedures. Writing table_var = table_var.append(..) inside a procedure def modify(table_var) will only create a new variable table_var instead of modifying a procedure's argument. So you would really want to use table_var.append(.., inplace=True) here.

iljya · 2020-03-04T14:38:33Z

It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :)

jreback · 2020-03-04T18:52:13Z

It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :)

how is inplace good sw design at all?

it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context

we are going to remove this as a soon as possible

iljya · 2020-03-04T21:50:31Z

inplace was requested (and upvoted) for the purpose of avoiding global variables (see above), so that a function could modify a data frame in place. Avoiding global variables is what I was referring to with "good sw design".

…

On Wed., Mar. 4, 2020, 13:52 Jeff Reback, ***@***.***> wrote: It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :) how is inplace good sw design at all? it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context we are going to remove this as a soon as possible — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2801?email_source=notifications&email_token=ABLCRH2Y236Z4JUXHSMNIY3RF2PN7A5CNFSM4ADIVIAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZSFTY#issuecomment-594748111>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLCRH4SXJUBF2U43OHTGSLRF2PN7ANCNFSM4ADIVIAA> .

jreback · 2020-03-04T22:41:26Z

inplace was requested (and upvoted) for the purpose of avoiding global variables (see above), so that a function could modify a data frame in place. Avoiding global variables is what I was referring to with "good sw design".
…

and using global variables like that is not good design at all

iljya · 2020-03-04T22:50:44Z

that's exactly my point

…

On Wed., Mar. 4, 2020, 17:41 Jeff Reback, ***@***.***> wrote: inplace was requested (and upvoted) for the purpose of avoiding global variables (see above), so that a function could modify a data frame in place. Avoiding global variables is what I was referring to with "good sw design". … <#m_8295026982206183008_> and using global variables like that is not good design at all — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2801?email_source=notifications&email_token=ABLCRH5CPZU4IKMHDI6GI4LRF3KJRA5CNFSM4ADIVIAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN2YQJA#issuecomment-594905124>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLCRH3U3N7VITZ24G4RUW3RF3KJRANCNFSM4ADIVIAA> .

jreback · 2020-03-04T23:02:24Z

you don’t need inplace to avoid globals

i’m amy event inplace is being depreciated

NumesSanguis · 2020-03-05T00:14:59Z

@jreback Thanks for replying. Is the stance on inplace being bad your opinion, or is it shared among the Pandas team? There are some good examples above in my opinion, unrelated to globals, that argue for having inplace. Also, to me that keyword is straightforward enough that I cannot agree with making code hard to read / magic opinion.

Has there been any public discussion about whether to drop inplace, because before your comment I was not aware that it will be depreciated.

jreback · 2020-03-05T00:30:18Z

@NumesSanguis it is both my option and virtually all of the core team; there is an issue about deprecation

Also, to me that keyword is straightforward enough that I cannot agree with making code hard to read / magic opinion

this is what inplace causes; the result is magical / hard to read code

stevennic · 2020-06-26T04:24:58Z

It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :)

how is inplace good sw design at all?

it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context

we are going to remove this as a soon as possible

Then why have inplace for other functions like drop?

In my opinion having an inplace parameter improves readability, just like it does for drop, regardless of any performance benefit.

What you call "magical things" I could call "a layer of abstraction".

dleuthe · 2020-09-25T09:51:27Z

Is there any update regarding this issue? Seems quit important due to upvotes - why was it closed long time ago.
inplace would be greate for avoiding global variables. Especially when using for..in loops.

TomAugspurger · 2020-09-25T11:42:39Z

We're discussing deprecating DataFrame.append in #35407.

We feel that the name doesn't accurately reflect the memory usage of the method, and would like to discourage code that's similar to some of the examples posted in this thread.

jonas-eschle · 2021-08-24T10:24:56Z

how is inplace good sw design at all?

it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context

we are going to remove this as a soon as possible

sarcasm on
Good point. Let's move that to the Python core dev list and argue to remove the inplace operations on list, dicst... in general to remove mutability (as it is bad software design). Then let's remove mutability from pandas et voila, no bad software design anymore.

Then, after all this changes, we can reconsider this argument here ;)
sarcasm off

No, this is sometimes just efficient. There is software and software. And some software has such a small memory footprint that it simply doesn't matter. But some, such as scientific software, has. And having the option to do things in-place can be a necessity

mzeitlin11 · 2021-08-24T23:24:02Z

Good point. Let's move that to the Python core dev list and argue to remove the inplace operations on list, dicst... in general to remove mutability (as it is bad software design). Then let's remove mutability from pandas et voila, no bad software design anymore.

This is less about mutability being bad software design and more about it being bad software design to have a parameter inplace which suggests an operation is happening without a copy, but actually can't be guaranteed to not be copying (especially in the case of append where arrays grow, so new allocation is necessary). The confusion mutability can cause then doesn't have any benefit over just assigning a variable to the result (since that is what the inplace option would be doing under the hood anyway)

But some, such as scientific software, has. And having the option to do things in-place can be a necessity

If inplace actually avoided a copy and saved memory it would make sense to include. But it will still copy anyway, so it just gives additional confusion since a user might think inplace saves memory, when it actually doesn't.

Mahdi-Hosseinali · 2022-01-13T03:26:32Z

mutability is not always about the assurance that it is in place to be idiomatically wrong. For example, I have a generator that might need to update one of the inputs. It is horrible software design but this is always the case in pretty much every production environment. Some legacy code that end up having bad design because the code has evolved. Not being to mutate that dataframe in place (even with everything being copied in the back) just makes it harder to find a workaround in this cases. This is possible by dictionaries and lists.

zdwhite · 2022-01-19T17:02:32Z

If inplace actually avoided a copy and saved memory it would make sense to include. But it will still copy anyway, so it just gives additional confusion since a user might think inplace saves memory when it actually doesn't.

This is interesting I never thought it saved memory if it did I understand why it would automatically be included. I really just always thought it seemed silly to continually re-assign things for this operation and other users have pointed out doing operations that feel like the inverse don't force the same workflow. It feels intuitive for this parameter to exist and less intuitive to always have to explicitly state re-assignment.

Every time you are thinking about adding to your series dataframe ect you intuitively add the inplace parameter despite being told thousands of times it doesn't exist in this context. Is it really that hard to do re-assignment? No, but it's no secret that this thread is about what seems intuitive, not about "what you want this thing to do isn't actually better".

I thought I should chime in as I feel like I kicked a hornet's nest over two years ago.

stevennic · 2022-01-19T19:27:19Z

This really seems like a clash of principles: Pythonic vs. Functional programming. Simplicity vs. Immutability. On the Python platform, I personally think the former should take precedence over the latter. The reverse would be true on a Functional platform.

Mahdi-Hosseinali · 2022-01-20T14:41:06Z

The bottom line is that there are a lot of cases where the programmer wants to keep the reference and modify the content. They might not always be in line with best practices but they could be necessary. If the concern is about "inplace" keyword is misleading, a more descriptive keyword can be introduced.

smojef · 2022-09-21T16:41:43Z

Can someone please provide a workaround code to keep the reference of a DataFrame object (the outer shell) while modifying its internal content (adding/modifying rows/columns)?
For example, when a module is passed a list of DataFrames to update, how can this module append rows to existing DataFrame references (inside the list) without .append( , inplace=True) ?

jreback closed this as completed Sep 20, 2013

dragonator4 mentioned this issue Dec 4, 2016

ENH: Pandas DataFrame.append and Series.append methods should get an inplace kwag #14796

Closed

jason-curtis mentioned this issue Jul 25, 2019

Temperature equilibration OsmoSystems/calibration-environment#25

Merged

ENH: Add 'inplace' parameter to DataFrame.append() #2801

ENH: Add 'inplace' parameter to DataFrame.append() #2801

Comments

darindillon commented Feb 5, 2013

wesm commented Feb 6, 2013

darindillon commented Feb 6, 2013

jreback commented Feb 6, 2013

knowitnothing commented May 8, 2013

jreback commented May 8, 2013

knowitnothing commented May 8, 2013

jreback commented May 9, 2013

knowitnothing commented May 9, 2013

jreback commented May 9, 2013

knowitnothing commented May 9, 2013

jreback commented May 9, 2013

jreback commented May 9, 2013

knowitnothing commented May 9, 2013

jreback commented May 9, 2013

knowitnothing commented May 9, 2013

jreback commented May 9, 2013

ghost commented Jun 16, 2016

vincent-yao27 commented Nov 29, 2018

NumesSanguis commented Dec 27, 2018 • edited Loading

zdwhite commented Jul 9, 2019

Anntuanette commented Jul 10, 2019 • edited Loading

iljya commented Mar 4, 2020

jreback commented Mar 4, 2020

iljya commented Mar 4, 2020 via email

jreback commented Mar 4, 2020

iljya commented Mar 4, 2020 via email

jreback commented Mar 4, 2020

NumesSanguis commented Mar 5, 2020 • edited Loading

jreback commented Mar 5, 2020

stevennic commented Jun 26, 2020

dleuthe commented Sep 25, 2020 • edited Loading

TomAugspurger commented Sep 25, 2020

jonas-eschle commented Aug 24, 2021

mzeitlin11 commented Aug 24, 2021

Mahdi-Hosseinali commented Jan 13, 2022 • edited Loading

zdwhite commented Jan 19, 2022

stevennic commented Jan 19, 2022

Mahdi-Hosseinali commented Jan 20, 2022

smojef commented Sep 21, 2022

NumesSanguis commented Dec 27, 2018 •

edited

Loading

Anntuanette commented Jul 10, 2019 •

edited

Loading

NumesSanguis commented Mar 5, 2020 •

edited

Loading

dleuthe commented Sep 25, 2020 •

edited

Loading

Mahdi-Hosseinali commented Jan 13, 2022 •

edited

Loading