Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Closed
PMeira opened this issue Jan 16, 2019 · 9 comments
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@PMeira
Copy link

PMeira commented Jan 16, 2019

Code Sample

import pandas as pd

NUM_ROWS = 1000
CHUNKSIZE = 20

with open('test.csv', 'w') as f:
    for i in range(NUM_ROWS):
        f.write('{}\n'.format(i))
        
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
    print(chunk_index)

Problem description

In v0.24.0rc1, using chunksize in pandas.read_csv with the C engine causes exponential memory growth (engine='python' works fine).

The code sample I listed uses a very small chunksize to better illustrate the issue but the issue happens with more realistic values like NUM_ROWS = 1000000 and CHUNKSIZE = 1024. The low_memory parameter in pd.read_csv() doesn't affect the behavior.

On Windows, the process becomes very slow as memory usage grows.

On Linux, an out-of-memory exception is raised after some chunks are processed and the buffer length grows too much. For example:

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Traceback (most recent call last):
  File "test_csv.py", line 10, in <module>
    for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1110, in __next__
    return self.get_chunk()
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1168, in get_chunk
    return self.read(nrows=size)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1134, in read
    ret = self._engine.read(nrows)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1977, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 962, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 949, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2166, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory

I tried to debug the C code from the tokenizer as of 0bd454c. The unexpected behavior seems present since 011b79f which introduces these lines (and other changes) to fix #23509:

/**
* If we are reading in chunks, we need to be aware of the maximum number
* of words we have seen in previous chunks (self->max_words_cap), so
* that way, we can properly allocate when reading subsequent ones.
*
* Otherwise, we risk a buffer overflow if we mistakenly under-allocate
* just because a recent chunk did not have as many words.
*/
if (self->words_len + nbytes < self->max_words_cap) {
length = self->max_words_cap - nbytes;
} else {
length = self->words_len;
}

I'm not familiar with the code, so I could be misinterpreting it, but I believe that code block, coupled with how self->words_cap and self->max_words_cap are handled could be the source of the issue. There are some potentially misleading variables names like nbytes that seem to refer to the number of bytes that are later interpreted as nbytes tokens -- I couldn't follow what's happening but hopefully this report helps.

It seems the issue could also be related to #16537 and #21516 but the specific changes that cause it are newer, not present in previous releases.

Expected Output

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0rc1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@WillAyd
Copy link
Member

WillAyd commented Jan 16, 2019

cc @gfyoung

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Jan 16, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 16, 2019

@PMeira : Thanks for reporting this! I can confirm this error as well.

Did your code example previously work with 0.23.4 by any chance?

@h-vetinari
Copy link
Contributor

Code from the OP works for me on 0.23.4.

@PMeira
Copy link
Author

PMeira commented Jan 16, 2019

@gfyoung Yes, it worked fine with version 0.23.4, just checked to be sure.

I first noticed it from a failing test in another module that I'm trying to port to recent versions of Pandas.

@gfyoung
Copy link
Member

gfyoung commented Jan 16, 2019

@PMeira @h-vetinari : Thanks for looking into this! Sounds like we have a regression on our hands...

@gfyoung gfyoung added the Regression Functionality that used to work in a prior pandas version label Jan 16, 2019
@TomAugspurger
Copy link
Contributor

@gfyoung do you have time to do this for 0.24.0? I don't think we have a release date set yet, but sometime in the next week or so?

@gfyoung
Copy link
Member

gfyoung commented Jan 18, 2019

@TomAugspurger : Yep, I'm going to look into this on the weekend.

@gfyoung gfyoung added this to the 0.24.0 milestone Jan 18, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 19, 2019

@PMeira : Your observations are validated by what happens behind the scenes, as your numbers produce a snowball effect that causes the memory allocation to double with every iteration of reading.

It is indeed an edge case, as your numbers work just perfectly to cause memory allocated to be powers of 2. In fact, your "smaller example" fails for me for that reason on my local machine.

I think I have a patch for this that prevents the memory usage from growing exponentially, but I need to test to make sure I didn't break anything else with it.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 19, 2019
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 19, 2019
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.
@PMeira
Copy link
Author

PMeira commented Jan 19, 2019

@gfyoung Thank you for the patch! I reran the affected code with the patch and it now runs without issues for several chunksize values.

It seems the edge case is the number of chunks, that is, since it basically doubled the memory when processing each chunk, if the number of chunks was large(ish), it would exhaust the memory resources. E.g. this presented the same issue but works fine now with the patched tokernizer.c:

import pandas as pd

CHUNKSIZE = 10_000_000
NUM_ROWS = CHUNKSIZE * 20

with open('test.csv', 'w') as f:
    for i in range(NUM_ROWS):
        f.write('{}\n'.format(i))

for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c', low_memory=False)):
    print(chunk_index)

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 20, 2019
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.
jreback pushed a commit that referenced this issue Jan 20, 2019
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes gh-24805.

xref gh-23527.

* TST: Add ASV benchmark for issue
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

* TST: Add ASV benchmark for issue
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

* TST: Add ASV benchmark for issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants