-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add 'inplace' parameter to DataFrame.append() #2801
Comments
It actually wouldn't because new arrays still have to be allocated and the data copied over |
Hmm, interesting. Well, it would be convenient to have the parameter anyway, just to simplify code (even if there's no performance boost) |
i often append to really big tables on disk (using HDFStore) http://pandas.pydata.org/pandas-docs/stable/io.html#storing-in-table-format |
Isn't it possible to pre-alloc a larger-than-initially-needed DataFrame (possibly via a parameter) and make short appends efficient ? It would be nice to combine that with resizes that go beyond the imediate needs, reducing reallocations. This should be all obvious, and since I never touched Pandas code I guess there is some impeding reason for not doing that ? The case I'm thinking about is that of data coming in real-time, and then one appends a DataFrame with a single entry to a larger one. |
can you give an example of how you are using this (and include some parameters that would 'simulate' what you are doing?) as an aside, a way of possibly mitigate this is to create new frames every so often (depends on your frequency of updates), then concat them together in one fell swoop (so you are appending to only a very small frame) |
I'm not using Pandas for that case I mentioned, but I'm considering it. I guess by "an example" you mean an extended version of that last phrase I included in the previous comment ? So here is the extended example: the program receives live data from a given exchange. Let us restrict that to "trade" data, i.e. if a sell order or a buy order is filled in a given a exchange, the program receives a message telling that a buy/sell order was filled at a given price and a given volume. There might be additional details, but they are irrelevant here. So, suppose this exchange is just starting and the first trade on it just happened. Create a DataFrame for it. Now a new trade happened, append the just received to the earlier DataFrame. And so on. It is very interesting to use Pandas to resample this DataFrame up-to-the-last update so we can apply different analysis on it, in real time. It is also very interesting that the DataFrame can be stored in HDF5, while not a Pandas feature, it provides an easy way to do so. It might be the case that appending data to HDF5 is fast enough for this situation, and Pandas can retrieve the appended-DataFrame from the storage fast enough too. I have no benchmark data for this, by the way. |
appending to HDF5 will be very easy to do here, to save a record of what you are doing, and you will be able to read from that HDF5 (in the same process and sequentially), e.g. you write, then read, and do your processing. Doing this in separate processes is problematic; there is no 'locking' of the HDF5 file per se. This is still allocating memory for the entire read back There is nothing conceptually wrong with appending to an existing frame, it has to allocate new memory, but unless you are dealing with REALLY big frames, this shouldn't be a problem
I suspect your bottleneck will not be this at all, but the actual operations you want to do on the frame
Write your program and profile it my favorite saying: premature optimization is the root of all evil |
The dataframes can get big, but I guess it depends on what you mean by big. I have this data stored in another format taking ~5 million rows right now, "importing" it to a DataFrame is a one-time-heavy process but that is fine. I'm worried about reallocing 5 mil + 1, 5 mil + 1 + 1, for each append. If the implementation takes O(n) for something that could be amortized to O(1) then this could become a bottleneck (or maybe already is for some given application, which then moved on to something else). |
your are much better off doing a marginal calculation anyhow if u are adding 1 point to 5m then it doesn't affect the stats of the 5m |
Thus my earlier point: "... It might be the case that appending data to HDF5 is fast enough for this situation ...". I would actually continuously store new data in HDF5 by appending to what I currently have. And then I would use a subset of this stored DataFrame to do the analysis. The possible advantage of not using HDF5 is that it we could guarantee that all the data is in memory, otherwise we have to trust on HDF5 being good/fast enough. |
Here's a way to preallocate fill in rows, increment your indexer (realloc if you run out of space)
|
you can do a combination of all of these approaches, you know your data and your workflow best |
The problem with your prealloc example is that you know the index values, I don't know them beforehand. Can you set index to NaN and later modify it without incurring more than constant time ? Thinking about this.. I guess I could use timestamp_{i-1} + 1 nanosecond for the prealloc. But I would still need to update the index when inserting actual data. Is that possible ? It would mostly solve the initial suggestion. |
use the index like I did, add your 'index' as another column (which can be nan, then fill in as u fill the rows), then func(df.iloc[0:indexer].set_index('my_index')) |
I will properly evaluate these suggestions, thank you :) |
good luck |
hey "premature optimization is the root of all evil"! Awesome quote! Strange that this issue is closed and I get "TypeError: append() got an unexpected keyword argument 'inplace'". |
@jreback A for df in df_list:
df = df.append(...) # no effects on df_list In the case above, there are still counter-intuitive workarounds like for idx in range(len(df_list)):
df_list[idx] = df_list[idx].append(...) However, in some case, it just doesn't work. A_df_list, B_df_list = ...
df_list = A_df_list + B_df_list
for idx in range(len(df_list)):
df_list[idx] = df_list[idx].append(...)
# no effects on A_df_list and B_df_list |
@jreback , I agree with @vincent-yao27 . An
With
|
Why is this issue closed a year and a half on??? |
inplace option is very much needed when you modify a table using procedures. Writing |
It seems quite a number of people are interested in the inplace parameter for the append method for reasons of good software design (vs. performance). Could someone from the team weigh-in on the difficulty of adding this and prioritize? Or at least reopen the issue? :) |
how is inplace good sw design at all? it’s completely non idiomatic, makes code very hard to read and adds magical things that are not apparent from context we are going to remove this as a soon as possible |
inplace was requested (and upvoted) for the purpose of avoiding global
variables (see above), so that a function could modify a data frame in
place. Avoiding global variables is what I was referring to with "good sw
design".
…On Wed., Mar. 4, 2020, 13:52 Jeff Reback, ***@***.***> wrote:
It seems quite a number of people are interested in the inplace parameter
for the append method for reasons of good software design (vs.
performance). Could someone from the team weigh-in on the difficulty of
adding this and prioritize? Or at least reopen the issue? :)
how is inplace good sw design at all?
it’s completely non idiomatic, makes code very hard to read and adds
magical things that are not apparent from context
we are going to remove this as a soon as possible
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2801?email_source=notifications&email_token=ABLCRH2Y236Z4JUXHSMNIY3RF2PN7A5CNFSM4ADIVIAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZSFTY#issuecomment-594748111>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLCRH4SXJUBF2U43OHTGSLRF2PN7ANCNFSM4ADIVIAA>
.
|
and using global variables like that is not good design at all |
that's exactly my point
…On Wed., Mar. 4, 2020, 17:41 Jeff Reback, ***@***.***> wrote:
inplace was requested (and upvoted) for the purpose of avoiding global
variables (see above), so that a function could modify a data frame in
place. Avoiding global variables is what I was referring to with "good sw
design".
… <#m_8295026982206183008_>
and using global variables like that is not good design at all
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2801?email_source=notifications&email_token=ABLCRH5CPZU4IKMHDI6GI4LRF3KJRA5CNFSM4ADIVIAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN2YQJA#issuecomment-594905124>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLCRH3U3N7VITZ24G4RUW3RF3KJRANCNFSM4ADIVIAA>
.
|
you don’t need inplace to avoid globals i’m amy event inplace is being depreciated |
@jreback Thanks for replying. Is the stance on Has there been any public discussion about whether to drop |
@NumesSanguis it is both my option and virtually all of the core team; there is an issue about deprecation
this is what inplace causes; the result is magical / hard to read code |
Then why have In my opinion having an What you call "magical things" I could call "a layer of abstraction". |
Is there any update regarding this issue? Seems quit important due to upvotes - why was it closed long time ago. |
We're discussing deprecating DataFrame.append in #35407. We feel that the name doesn't accurately reflect the memory usage of the method, and would like to discourage code that's similar to some of the examples posted in this thread. |
sarcasm on Then, after all this changes, we can reconsider this argument here ;) No, this is sometimes just efficient. There is software and software. And some software has such a small memory footprint that it simply doesn't matter. But some, such as scientific software, has. And having the option to do things in-place can be a necessity |
This is less about mutability being bad software design and more about it being bad software design to have a parameter
If |
mutability is not always about the assurance that it is in place to be idiomatically wrong. For example, I have a generator that might need to update one of the inputs. It is horrible software design but this is always the case in pretty much every production environment. Some legacy code that end up having bad design because the code has evolved. Not being to mutate that dataframe in place (even with everything being copied in the back) just makes it harder to find a workaround in this cases. This is possible by dictionaries and lists. |
This is interesting I never thought it saved memory if it did I understand why it would automatically be included. I really just always thought it seemed silly to continually re-assign things for this operation and other users have pointed out doing operations that feel like the inverse don't force the same workflow. It feels intuitive for this parameter to exist and less intuitive to always have to explicitly state re-assignment. Every time you are thinking about adding to your series dataframe ect you intuitively add the I thought I should chime in as I feel like I kicked a hornet's nest over two years ago. |
This really seems like a clash of principles: Pythonic vs. Functional programming. Simplicity vs. Immutability. On the Python platform, I personally think the former should take precedence over the latter. The reverse would be true on a Functional platform. |
The bottom line is that there are a lot of cases where the programmer wants to keep the reference and modify the content. They might not always be in line with best practices but they could be necessary. If the concern is about "inplace" keyword is misleading, a more descriptive keyword can be introduced. |
Can someone please provide a workaround code to keep the reference of a DataFrame object (the outer shell) while modifying its internal content (adding/modifying rows/columns)? |
DataFrame.append() ought to have a "inplace=True" parameter to allow modifying the existing dataframe rather than copying it. This would be a big performance gain for large dataframes.
The text was updated successfully, but these errors were encountered: