Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add 'inplace' parameter to DataFrame.append() #2801

Closed
darindillon opened this issue Feb 5, 2013 · 39 comments
Closed

ENH: Add 'inplace' parameter to DataFrame.append() #2801

darindillon opened this issue Feb 5, 2013 · 39 comments

Comments

@darindillon
Copy link

DataFrame.append() ought to have a "inplace=True" parameter to allow modifying the existing dataframe rather than copying it. This would be a big performance gain for large dataframes.

@wesm
Copy link
Member

wesm commented Feb 6, 2013

It actually wouldn't because new arrays still have to be allocated and the data copied over

@darindillon
Copy link
Author

Hmm, interesting. Well, it would be convenient to have the parameter anyway, just to simplify code (even if there's no performance boost)

@jreback
Copy link
Contributor

jreback commented Feb 6, 2013

i often append to really big tables on disk (using HDFStore)

http://pandas.pydata.org/pandas-docs/stable/io.html#storing-in-table-format

@knowitnothing
Copy link

Isn't it possible to pre-alloc a larger-than-initially-needed DataFrame (possibly via a parameter) and make short appends efficient ? It would be nice to combine that with resizes that go beyond the imediate needs, reducing reallocations. This should be all obvious, and since I never touched Pandas code I guess there is some impeding reason for not doing that ?

The case I'm thinking about is that of data coming in real-time, and then one appends a DataFrame with a single entry to a larger one.

@jreback
Copy link
Contributor

jreback commented May 8, 2013

can you give an example of how you are using this (and include some parameters that would 'simulate' what you are doing?)

as an aside, a way of possibly mitigate this is to create new frames every so often (depends on your frequency of updates), then concat them together in one fell swoop (so you are appending to only a very small frame)

@knowitnothing
Copy link

I'm not using Pandas for that case I mentioned, but I'm considering it. I guess by "an example" you mean an extended version of that last phrase I included in the previous comment ?

So here is the extended example: the program receives live data from a given exchange. Let us restrict that to "trade" data, i.e. if a sell order or a buy order is filled in a given a exchange, the program receives a message telling that a buy/sell order was filled at a given price and a given volume. There might be additional details, but they are irrelevant here. So, suppose this exchange is just starting and the first trade on it just happened. Create a DataFrame for it. Now a new trade happened, append the just received to the earlier DataFrame. And so on. It is very interesting to use Pandas to resample this DataFrame up-to-the-last update so we can apply different analysis on it, in real time. It is also very interesting that the DataFrame can be stored in HDF5, while not a Pandas feature, it provides an easy way to do so. It might be the case that appending data to HDF5 is fast enough for this situation, and Pandas can retrieve the appended-DataFrame from the storage fast enough too. I have no benchmark data for this, by the way.

@jreback
Copy link
Contributor

jreback commented May 9, 2013

appending to HDF5 will be very easy to do here, to save a record of what you are doing, and you will be able to read from that HDF5 (in the same process and sequentially), e.g. you write, then read, and do your processing.

Doing this in separate processes is problematic; there is no 'locking' of the HDF5 file per se.

This is still allocating memory for the entire read back

There is nothing conceptually wrong with appending to an existing frame, it has to allocate new memory, but unless you are dealing with REALLY big frames, this shouldn't be a problem

In [1]: df = DataFrame(randn(100000,2),columns=list('AB'))

In [2]: df2 = DataFrame(randn(10,2),columns=list('AB'))

In [3]: %timeit df.append(df2)
1000 loops, best of 3: 431 us per loop

I suspect your bottleneck will not be this at all, but the actual operations you want to do on the frame

In [7]: df3 = df.append(df2)

In [8]: %timeit df3.mean()
100 loops, best of 3: 3.04 ms per loop

Write your program and profile it

my favorite saying: premature optimization is the root of all evil

@knowitnothing
Copy link

The dataframes can get big, but I guess it depends on what you mean by big. I have this data stored in another format taking ~5 million rows right now, "importing" it to a DataFrame is a one-time-heavy process but that is fine. I'm worried about reallocing 5 mil + 1, 5 mil + 1 + 1, for each append.

If the implementation takes O(n) for something that could be amortized to O(1) then this could become a bottleneck (or maybe already is for some given application, which then moved on to something else).

@jreback
Copy link
Contributor

jreback commented May 9, 2013

your are much better off doing a marginal calculation anyhow

if u are adding 1 point to 5m then it doesn't affect the stats of the 5m
so I would just calc the stats u need, write it to hdf for storage and later retrieval and do your calc
should be much more efficient

@knowitnothing
Copy link

Thus my earlier point: "... It might be the case that appending data to HDF5 is fast enough for this situation ...". I would actually continuously store new data in HDF5 by appending to what I currently have. And then I would use a subset of this stored DataFrame to do the analysis.

The possible advantage of not using HDF5 is that it we could guarantee that all the data is in memory, otherwise we have to trust on HDF5 being good/fast enough.

@jreback
Copy link
Contributor

jreback commented May 9, 2013

Here's a way to preallocate
create the frame bigger than you need (e.g. the existing + the expected)

fill in rows, increment your indexer (realloc if you run out of space)
calc your function that selects <= the indexer
repeat

In [7]: df = DataFrame(index=range(5),columns=list('AB'))

In [8]: df.iloc[0] = Series(dict(A = 10, B = 5))

In [9]: df.iloc[1] = Series(dict(A = 11, B = 6))

In [10]: def f(x,indexer):
   ....:     return x.iloc[0:indexer]*2
   ....: 

In [11]: f(df,2)
Out[11]: 
    A   B
0  20  10
1  22  12

In [12]: df.iloc[2] = Series(dict(A = 12, B = 7))

In [13]: f(df,3)
Out[13]: 
    A   B
0  20  10
1  22  12
2  24  14

@jreback
Copy link
Contributor

jreback commented May 9, 2013

you can do a combination of all of these approaches, you know your data and your workflow best

@knowitnothing
Copy link

The problem with your prealloc example is that you know the index values, I don't know them beforehand. Can you set index to NaN and later modify it without incurring more than constant time ? Thinking about this.. I guess I could use timestamp_{i-1} + 1 nanosecond for the prealloc. But I would still need to update the index when inserting actual data. Is that possible ? It would mostly solve the initial suggestion.

@jreback
Copy link
Contributor

jreback commented May 9, 2013

use the index like I did, add your 'index' as another column (which can be nan, then fill in as u fill the rows), then

func(df.iloc[0:indexer].set_index('my_index'))

@knowitnothing
Copy link

I will properly evaluate these suggestions, thank you :)

@jreback
Copy link
Contributor

jreback commented May 9, 2013

good luck

@jreback jreback closed this as completed Sep 20, 2013
@ghost
Copy link

ghost commented Jun 16, 2016

hey "premature optimization is the root of all evil"! Awesome quote! Strange that this issue is closed and I get "TypeError: append() got an unexpected keyword argument 'inplace'".
I know with scientists all variables are usually global. But if you attempt to do a proper software design (using methods and arguments) and you want to append to a dataframe in a callback somewhere this breaks the design. Back to evil global variables again!

@vincent-yao27
Copy link

@jreback A inplace parameter for append() is really needed in for..in loops.

for df in df_list:
  df = df.append(...) # no effects on df_list

In the case above, there are still counter-intuitive workarounds like

for idx in range(len(df_list)):
  df_list[idx] = df_list[idx].append(...)

However, in some case, it just doesn't work.

A_df_list, B_df_list = ...
df_list = A_df_list + B_df_list
for idx in range(len(df_list)):
  df_list[idx] = df_list[idx].append(...)
  # no effects on A_df_list and B_df_list

@NumesSanguis
Copy link

NumesSanguis commented Dec 27, 2018

@jreback , I agree with @vincent-yao27 . An inplace=True parameter would be useful in for loops when you deal with multiple dataframes.
It is even more useful when you have e.g. a function that takes series to append to a dataframe:

def add_ds_to_df(df, ds):
    df = df.append(ds, ignore_index=True)
    return df  # unnecessary need to return dataframe

With inplace:

def add_ds_to_df(df, ds):
    df.append(ds, ignore_index=True, inplace=True)

@zdwhite
Copy link

zdwhite commented Jul 9, 2019

Why is this issue closed a year and a half on???

@Anntuanette
Copy link

Anntuanette commented Jul 10, 2019

inplace option is very much needed when you modify a table using procedures. Writing table_var = table_var.append(..) inside a procedure def modify(table_var) will only create a new variable table_var instead of modifying a procedure's argument. So you would really want to use table_var.append(.., inplace=True) here.

@iljya
Copy link

iljya commented Mar 4, 2020

It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :)

@jreback
Copy link
Contributor

jreback commented Mar 4, 2020

It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :)

how is inplace good sw design at all?

it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context

we are going to remove this as a soon as possible

@iljya
Copy link

iljya commented Mar 4, 2020 via email

@jreback
Copy link
Contributor

jreback commented Mar 4, 2020

inplace was requested (and upvoted) for the purpose of avoiding global variables (see above), so that a function could modify a data frame in place. Avoiding global variables is what I was referring to with "good sw design".

and using global variables like that is not good design at all

@iljya
Copy link

iljya commented Mar 4, 2020 via email

@jreback
Copy link
Contributor

jreback commented Mar 4, 2020

you don’t need inplace to avoid globals

i’m amy event inplace is being depreciated

@NumesSanguis
Copy link

NumesSanguis commented Mar 5, 2020

@jreback Thanks for replying. Is the stance on inplace being bad your opinion, or is it shared among the Pandas team? There are some good examples above in my opinion, unrelated to globals, that argue for having inplace. Also, to me that keyword is straightforward enough that I cannot agree with making code hard to read / magic opinion.

Has there been any public discussion about whether to drop inplace, because before your comment I was not aware that it will be depreciated.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2020

@NumesSanguis it is both my option and virtually all of the core team; there is an issue about deprecation

Also, to me that keyword is straightforward enough that I cannot agree with making code hard to read / magic opinion

this is what inplace causes; the result is magical / hard to read code

@stevennic
Copy link
Contributor

It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :)

how is inplace good sw design at all?

it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context

we are going to remove this as a soon as possible

Then why have inplace for other functions like drop?

In my opinion having an inplace parameter improves readability, just like it does for drop, regardless of any performance benefit.

What you call "magical things" I could call "a layer of abstraction".

@dleuthe
Copy link

dleuthe commented Sep 25, 2020

Is there any update regarding this issue? Seems quit important due to upvotes - why was it closed long time ago.
inplace would be greate for avoiding global variables. Especially when using for..in loops.

@TomAugspurger
Copy link
Contributor

We're discussing deprecating DataFrame.append in #35407.

We feel that the name doesn't accurately reflect the memory usage of the method, and would like to discourage code that's similar to some of the examples posted in this thread.

@jonas-eschle
Copy link

how is inplace good sw design at all?

it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context

we are going to remove this as a soon as possible

sarcasm on
Good point. Let's move that to the Python core dev list and argue to remove the inplace operations on list, dicst... in general to remove mutability (as it is bad software design). Then let's remove mutability from pandas et voila, no bad software design anymore.

Then, after all this changes, we can reconsider this argument here ;)
sarcasm off

No, this is sometimes just efficient. There is software and software. And some software has such a small memory footprint that it simply doesn't matter. But some, such as scientific software, has. And having the option to do things in-place can be a necessity

@mzeitlin11
Copy link
Member

Good point. Let's move that to the Python core dev list and argue to remove the inplace operations on list, dicst... in general to remove mutability (as it is bad software design). Then let's remove mutability from pandas et voila, no bad software design anymore.

This is less about mutability being bad software design and more about it being bad software design to have a parameter inplace which suggests an operation is happening without a copy, but actually can't be guaranteed to not be copying (especially in the case of append where arrays grow, so new allocation is necessary). The confusion mutability can cause then doesn't have any benefit over just assigning a variable to the result (since that is what the inplace option would be doing under the hood anyway)

But some, such as scientific software, has. And having the option to do things in-place can be a necessity

If inplace actually avoided a copy and saved memory it would make sense to include. But it will still copy anyway, so it just gives additional confusion since a user might think inplace saves memory, when it actually doesn't.

@Mahdi-Hosseinali
Copy link

Mahdi-Hosseinali commented Jan 13, 2022

mutability is not always about the assurance that it is in place to be idiomatically wrong. For example, I have a generator that might need to update one of the inputs. It is horrible software design but this is always the case in pretty much every production environment. Some legacy code that end up having bad design because the code has evolved. Not being to mutate that dataframe in place (even with everything being copied in the back) just makes it harder to find a workaround in this cases. This is possible by dictionaries and lists.

@zdwhite
Copy link

zdwhite commented Jan 19, 2022

If inplace actually avoided a copy and saved memory it would make sense to include. But it will still copy anyway, so it just gives additional confusion since a user might think inplace saves memory when it actually doesn't.

This is interesting I never thought it saved memory if it did I understand why it would automatically be included. I really just always thought it seemed silly to continually re-assign things for this operation and other users have pointed out doing operations that feel like the inverse don't force the same workflow. It feels intuitive for this parameter to exist and less intuitive to always have to explicitly state re-assignment.

Every time you are thinking about adding to your series dataframe ect you intuitively add the inplace parameter despite being told thousands of times it doesn't exist in this context. Is it really that hard to do re-assignment? No, but it's no secret that this thread is about what seems intuitive, not about "what you want this thing to do isn't actually better".

I thought I should chime in as I feel like I kicked a hornet's nest over two years ago.

@stevennic
Copy link
Contributor

This really seems like a clash of principles: Pythonic vs. Functional programming. Simplicity vs. Immutability. On the Python platform, I personally think the former should take precedence over the latter. The reverse would be true on a Functional platform.

@Mahdi-Hosseinali
Copy link

The bottom line is that there are a lot of cases where the programmer wants to keep the reference and modify the content. They might not always be in line with best practices but they could be necessary. If the concern is about "inplace" keyword is misleading, a more descriptive keyword can be introduced.

@smojef
Copy link

smojef commented Sep 21, 2022

Can someone please provide a workaround code to keep the reference of a DataFrame object (the outer shell) while modifying its internal content (adding/modifying rows/columns)?
For example, when a module is passed a list of DataFrames to update, how can this module append rows to existing DataFrame references (inside the list) without .append( , inplace=True) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests