read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

PMeira · 2019-01-16T16:00:11Z

Code Sample

import pandas as pd

NUM_ROWS = 1000
CHUNKSIZE = 20

with open('test.csv', 'w') as f:
    for i in range(NUM_ROWS):
        f.write('{}\n'.format(i))
        
for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
    print(chunk_index)

Problem description

In v0.24.0rc1, using chunksize in pandas.read_csv with the C engine causes exponential memory growth (engine='python' works fine).

The code sample I listed uses a very small chunksize to better illustrate the issue but the issue happens with more realistic values like NUM_ROWS = 1000000 and CHUNKSIZE = 1024. The low_memory parameter in pd.read_csv() doesn't affect the behavior.

On Windows, the process becomes very slow as memory usage grows.

On Linux, an out-of-memory exception is raised after some chunks are processed and the buffer length grows too much. For example:

Traceback (most recent call last):
  File "test_csv.py", line 10, in <module>
    for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c')):
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1110, in __next__
    return self.get_chunk()
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1168, in get_chunk
    return self.read(nrows=size)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1134, in read
    ret = self._engine.read(nrows)
  File "/home/meira/.conda/envs/pandas024/lib/python3.6/site-packages/pandas/io/parsers.py", line 1977, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 962, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 949, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2166, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory

I tried to debug the C code from the tokenizer as of 0bd454c. The unexpected behavior seems present since 011b79f which introduces these lines (and other changes) to fix #23509:

pandas/pandas/_libs/src/parser/tokenizer.c

Lines 294 to 306 in 0bd454c

    
               /** 
        
                * If we are reading in chunks, we need to be aware of the maximum number 
        
                * of words we have seen in previous chunks (self->max_words_cap), so 
        
                * that way, we can properly allocate when reading subsequent ones. 
        
                * 
        
                * Otherwise, we risk a buffer overflow if we mistakenly under-allocate 
        
                * just because a recent chunk did not have as many words. 
        
                */ 
        
               if (self->words_len + nbytes < self->max_words_cap) { 
        
                   length = self->max_words_cap - nbytes; 
        
               } else { 
        
                   length = self->words_len; 
        
               }

I'm not familiar with the code, so I could be misinterpreting it, but I believe that code block, coupled with how self->words_cap and self->max_words_cap are handled could be the source of the issue. There are some potentially misleading variables names like nbytes that seem to refer to the number of bytes that are later interpreted as nbytes tokens -- I couldn't follow what's happening but hopefully this report helps.

It seems the issue could also be related to #16537 and #21516 but the specific changes that cause it are newer, not present in previous releases.

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.25-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0rc1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-01-16T16:15:48Z

cc @gfyoung

gfyoung · 2019-01-16T18:04:05Z

@PMeira : Thanks for reporting this! I can confirm this error as well.

Did your code example previously work with 0.23.4 by any chance?

h-vetinari · 2019-01-16T19:12:10Z

Code from the OP works for me on 0.23.4.

PMeira · 2019-01-16T19:20:52Z

@gfyoung Yes, it worked fine with version 0.23.4, just checked to be sure.

I first noticed it from a failing test in another module that I'm trying to port to recent versions of Pandas.

gfyoung · 2019-01-16T21:29:19Z

@PMeira @h-vetinari : Thanks for looking into this! Sounds like we have a regression on our hands...

TomAugspurger · 2019-01-18T17:08:01Z

@gfyoung do you have time to do this for 0.24.0? I don't think we have a release date set yet, but sometime in the next week or so?

gfyoung · 2019-01-18T18:00:40Z

@TomAugspurger : Yep, I'm going to look into this on the weekend.

gfyoung · 2019-01-19T11:29:16Z

@PMeira : Your observations are validated by what happens behind the scenes, as your numbers produce a snowball effect that causes the memory allocation to double with every iteration of reading.

It is indeed an edge case, as your numbers work just perfectly to cause memory allocated to be powers of 2. In fact, your "smaller example" fails for me for that reason on my local machine.

I think I have a patch for this that prevents the memory usage from growing exponentially, but I need to test to make sure I didn't break anything else with it.

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805.

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

PMeira · 2019-01-19T19:25:27Z

@gfyoung Thank you for the patch! I reran the affected code with the patch and it now runs without issues for several chunksize values.

It seems the edge case is the number of chunks, that is, since it basically doubled the memory when processing each chunk, if the number of chunks was large(ish), it would exhaust the memory resources. E.g. this presented the same issue but works fine now with the patched tokernizer.c:

import pandas as pd

CHUNKSIZE = 10_000_000
NUM_ROWS = CHUNKSIZE * 20

with open('test.csv', 'w') as f:
    for i in range(NUM_ROWS):
        f.write('{}\n'.format(i))

for chunk_index, chunk in enumerate(pd.read_csv('test.csv', chunksize=CHUNKSIZE, engine='c', low_memory=False)):
    print(chunk_index)

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes gh-24805. xref gh-23527. * TST: Add ASV benchmark for issue

* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527. * TST: Add ASV benchmark for issue

WillAyd added the IO CSV read_csv, to_csv label Jan 16, 2019

gfyoung added the Regression Functionality that used to work in a prior pandas version label Jan 16, 2019

gfyoung added this to the 0.24.0 milestone Jan 18, 2019

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 19, 2019

Fix memory growth bug in read_csv

2f0de34

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 19, 2019

Fix memory growth bug in read_csv

12aaad0

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

gfyoung mentioned this issue Jan 19, 2019

Fix memory growth bug in read_csv #24837

Merged

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 20, 2019

Fix memory growth bug in read_csv

e241796

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

jreback closed this as completed in #24837 Jan 20, 2019

PMeira mentioned this issue Feb 4, 2019

Issues on Pycharm nilmtk/nilmtk#720

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

PMeira commented Jan 16, 2019 •

edited

Loading

WillAyd commented Jan 16, 2019

gfyoung commented Jan 16, 2019

h-vetinari commented Jan 16, 2019

PMeira commented Jan 16, 2019

gfyoung commented Jan 16, 2019

TomAugspurger commented Jan 18, 2019

gfyoung commented Jan 18, 2019

gfyoung commented Jan 19, 2019

PMeira commented Jan 19, 2019

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Comments

PMeira commented Jan 16, 2019 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Jan 16, 2019

gfyoung commented Jan 16, 2019

h-vetinari commented Jan 16, 2019

PMeira commented Jan 16, 2019

gfyoung commented Jan 16, 2019

TomAugspurger commented Jan 18, 2019

gfyoung commented Jan 18, 2019

gfyoung commented Jan 19, 2019

PMeira commented Jan 19, 2019

PMeira commented Jan 16, 2019 •

edited

Loading

Output of `pd.show_versions()`