-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805
Comments
cc @gfyoung |
@PMeira : Thanks for reporting this! I can confirm this error as well. Did your code example previously work with |
Code from the OP works for me on 0.23.4. |
@gfyoung Yes, it worked fine with version I first noticed it from a failing test in another module that I'm trying to port to recent versions of Pandas. |
@PMeira @h-vetinari : Thanks for looking into this! Sounds like we have a regression on our hands... |
@gfyoung do you have time to do this for 0.24.0? I don't think we have a release date set yet, but sometime in the next week or so? |
@TomAugspurger : Yep, I'm going to look into this on the weekend. |
@PMeira : Your observations are validated by what happens behind the scenes, as your numbers produce a snowball effect that causes the memory allocation to double with every iteration of reading. It is indeed an edge case, as your numbers work just perfectly to cause memory allocated to be powers of 2. In fact, your "smaller example" fails for me for that reason on my local machine. I think I have a patch for this that prevents the memory usage from growing exponentially, but I need to test to make sure I didn't break anything else with it. |
The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805.
The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.
@gfyoung Thank you for the patch! I reran the affected code with the patch and it now runs without issues for several chunksize values. It seems the edge case is the number of chunks, that is, since it basically doubled the memory when processing each chunk, if the number of chunks was large(ish), it would exhaust the memory resources. E.g. this presented the same issue but works fine now with the patched import pandas as pd
CHUNKSIZE = 10_000_000
NUM_ROWS = CHUNKSIZE * 20
with open('test.csv', 'w') as f:
for i in range(NUM_ROWS):
f.write('{}\n'.format(i))
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c', low_memory=False)):
print(chunk_index) |
The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.
* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527. * TST: Add ASV benchmark for issue
* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527. * TST: Add ASV benchmark for issue
Code Sample
Problem description
In v0.24.0rc1, using
chunksize
inpandas.read_csv
with the C engine causes exponential memory growth (engine='python'
works fine).The code sample I listed uses a very small chunksize to better illustrate the issue but the issue happens with more realistic values like
NUM_ROWS = 1000000
andCHUNKSIZE = 1024
. Thelow_memory
parameter inpd.read_csv()
doesn't affect the behavior.On Windows, the process becomes very slow as memory usage grows.
On Linux, an out-of-memory exception is raised after some chunks are processed and the buffer length grows too much. For example:
I tried to debug the C code from the tokenizer as of 0bd454c. The unexpected behavior seems present since 011b79f which introduces these lines (and other changes) to fix #23509:
pandas/pandas/_libs/src/parser/tokenizer.c
Lines 294 to 306 in 0bd454c
I'm not familiar with the code, so I could be misinterpreting it, but I believe that code block, coupled with how
self->words_cap
andself->max_words_cap
are handled could be the source of the issue. There are some potentially misleading variables names likenbytes
that seem to refer to the number of bytes that are later interpreted asnbytes tokens
-- I couldn't follow what's happening but hopefully this report helps.It seems the issue could also be related to #16537 and #21516 but the specific changes that cause it are newer, not present in previous releases.
Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: