C error: Buffer overflow caught on CSV with chunksize #23509

dgrahn · 2018-11-05T15:11:49Z

Code Sample

This will create the error, but it is slow. I recommend downloading the file directly.

import pandas
filename = 'https://github.com/pandas-dev/pandas/files/2548189/debug.txt'
for chunk in pandas.read_csv(filename, chunksize=1000, names=range(2504)):
    pass

Problem description

I get the following exception only while using the C engine. This is similar to #11166.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Expected Output

None. It should just loop through the file.

Output of `pd.show_versions()`

Both machines exhibit the exception.

RedHat

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.14.4.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.1.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Windows 7

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-05T15:21:47Z

Have you been able to narrow down what exactly in the linked file is causing the exception?

dgrahn · 2018-11-05T15:25:47Z

@TomAugspurger I have not. I'm unsure how to debug the C engine.

gfyoung · 2018-11-05T15:29:16Z

@dgrahn : I have strong reason to believe that this file is actually malformed. Run this code:

with open("debug.txt", "r") as f:
   data = f.readlines()

lengths = set()

# Get row width
#
# Delimiter is definitely ","
for l in data:
   l = l.strip()
   lengths.add(len(l.split(",")))

print(lengths)

This will output:

{2304, 1154, 2054, 904, 1804, 654, 1554, 404, 2454, 1304, 154, 2204, 1054, 1954, 804, 1704, 554, 1454, 304, 2354, 1204, 54, 2104, 954, 1854, 704, 1604, 454, 2504, 1354, 204, 2254, 1104, 2004, 854, 1754, 604, 1504, 354, 2404, 1254, 104, 2154, 1004, 1904, 754, 1654, 504, 1404, 254}

If the file was correctly formatted, it should be that there is only one row width.

dgrahn · 2018-11-05T15:31:06Z

@gfyoung It's not formatted incorrectly. It's a jagged CSV because I didn't want to bloat the file with lots of empty columns. That's why I use the names parameter.

gfyoung · 2018-11-05T15:35:04Z

@dgrahn : Yes, it is, according to our definition. We need properly formatted CSV's, and that means having the same number of comma's across the board for all rows. Jagged CSV's unfortunately do not meet that criterion.

dgrahn · 2018-11-05T15:37:22Z

@gfyoung It works when reading the entire CSV. How can I debug this for chunks? Neither saving the extra columns nor reading the entire file is a feasible option. This is already a subset of a 7 GB file.

gfyoung · 2018-11-05T15:39:17Z

It works when reading the entire CSV.

@dgrahn : Given that you mention that it's a subset, what do you mean by "entire CSV" ? Are you referring to the entire 7 GB file or all of debug.txt ? On my end, I cannot read all of debug.txt.

dgrahn · 2018-11-05T15:41:42Z

@gfyoung When I use the following, I'm able to read the entire CSV.

pd.read_csv('debug.csv', names=range(2504))

The debug file contains the first 7k lines of a file with more than 2.6M.

gfyoung · 2018-11-05T15:44:51Z

@dgrahn : I'm not sure you actually answered my question. Let me rephrase:

Are you able to read the file that you posted to GitHub in its entirety (via pd.read_csv)?

dgrahn · 2018-11-05T15:46:33Z

@gfyoung I'm able to read the debug file using the below code. But it fails when introducing the chunks. Does that answer the question?

pd.read_csv('debug.csv', names=range(2504))

gfyoung · 2018-11-05T15:47:21Z

Okay, got it. So I'm definitely not able to read all of debug.txt in its entirety (Ubuntu 64-bit, 0.23.4). What version of pandas are you using (and on which OS)?

dgrahn · 2018-11-05T15:48:42Z

@gfyoung Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.

dgrahn · 2018-11-05T15:50:39Z

Interestingly, when chunksize=10 it fails around line 6,810. When chunksize=100, it fails around 3100.

More details.

chunksize=1, no failure
chunksize=3, no failure
chunksize=4, failure=92-96
chunksize=5, failure=5515-5520
chunksize=10, failure= 6810-6820
chunksize=100, failure= 3100-3200

gfyoung · 2018-11-05T15:52:17Z

Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.

I saw, but I wasn't sure whether you meant that it worked on both environments.

dgrahn · 2018-11-05T16:06:01Z

Here's a smaller file which exhibits the same behavior.
minimal.txt

import pandas as pd

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549461/minimal.txt', names=range(2504), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

gfyoung · 2018-11-05T16:06:23Z

Okay, so I managed to read the file in its entirety on another environment. The C engine is "filling in the blanks" thanks to the names parameter that you passed in, so while I'm still wary of the jagged CSV format, pandas is a little more generous than I recalled.

As for the discrepancies, as was already noted in the older issue, passing in engine="python" works across the board. Thus, it remains to debug the C code and see why it breaks...

(@dgrahn : BTW, that is your answer to: "how would I debug chunks")

gfyoung · 2018-11-05T16:07:23Z

Here's a smaller file which exhibits the same behavior.

@dgrahn : Oh, that's very nice! Might you by any chance be able to make the file "skinnier" ?

(the smaller the file, the easier it would be for us to test)

dgrahn · 2018-11-05T16:11:09Z

@gfyoung Working on it now.

dgrahn · 2018-11-05T16:22:11Z

@gfyoung Ok. So it gets weirder. 2397 and below works, 2398 and above fails.

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549525/skinnier.txt', names=range(2397), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

print('-----')

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549525/skinnier.txt', names=range(2398), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

Each line has the following number of columns:

801
801
451
901
- chunk divider -
1001
1
201
1001

skinnier.txt

dgrahn · 2018-11-05T16:33:41Z

@gfyoung Ok. I have a minimal example.

minimal.txt

import pandas as pd

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549561/minimal.txt', names=range(5), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

gfyoung · 2018-11-05T16:35:23Z

@dgrahn : Nice! I'm on my phone currently, so a couple of questions:

Can you read this file in its entirety?
Does reading this file in chunks work with the Python engine?

Also, why do you have to pass in names=range(5) (and not say range(2)) ?

dgrahn · 2018-11-05T16:43:41Z

@gfyoung Ok. I tried different chunksizes from 1-20 and columns from 2-20.

Reading the entire file worked for columns 2-20.
Python engine worked for columns 2-20
C engine failed for the following conditions:

chunk=2,columns=7
chunk=2,columns=15
chunk=3,columns=7
chunk=3,columns=15
chunk=4,columns=5
chunk=6,columns=7
chunk=6,columns=15

dgrahn · 2018-11-05T16:48:03Z

@gfyoung I've tried varying the number of columns in the last row. Here's my results.

1 column

All work.

2 columns

chunksize, columns
2, 7
2, 15
3, 7
3, 15
4, 5
6, 7
6, 15

3 columns

chunksize, columns
2, 6
2, 7
2, 14
2, 15
3, 6
3, 7
3, 14
3, 15
4, 5
4, 10
5, 7
5, 15
6, 6
6, 7
6, 14
6, 15

4 columns

chunksize, columns
2, 13
2, 14
2, 15
3, 13
3, 14
3, 15
4, 5
4, 10
5, 7
5, 15
6, 13
6, 14
6, 15

gfyoung · 2018-11-05T16:51:22Z

@dgrahn : Thanks for the very thorough investigation! That is very helpful. I'll take a look at the C code later today and see what might be causing the discrepancy.

dgrahn · 2018-11-05T18:05:15Z

@gfyoung I tried to debug it myself by following the dev guide, but it says pandas has no attribute read_csv, so I think I better rely on your findings.

gfyoung · 2018-11-06T07:02:25Z

So I think I know what's happening. In short, with the C engine, we are able to allocate and de-allocate memory as we see fit. In our attempt to optimize space consumption after reading each chunk, the parser frees up all of the space needed to read a full row (i.e. 2,504 elements).

Unfortunately, when it tries to allocate again (at least when using this dataset), it comes across one of the "skinnier" rows, causing it to under-allocate and crash with the buffer overflow error (which is a safety measure and not a core-dumping error).

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

gfyoung · 2018-11-06T08:29:46Z

@dgrahn : I was able to patch it and can now read your debug.txt dataset successfully! PR soon.

dgrahn · 2018-11-06T11:41:21Z

Thank you! Can you point me to directions on integrating that change? Should I use a nightly build?

gfyoung · 2018-11-06T17:56:41Z

@dgrahn : My changes are still being reviewed for merging into master, but if you can install the branch immediately to test on your current files.

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes gh-23509.

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

gfyoung added the IO CSV read_csv, to_csv label Nov 5, 2018

gfyoung added the Bug label Nov 6, 2018

gfyoung mentioned this issue Nov 6, 2018

BUG: Don't over-optimize memory with jagged CSV #23527

Merged

jreback added this to the 0.24.0 milestone Nov 11, 2018

jreback closed this as completed in #23527 Nov 12, 2018

PMeira mentioned this issue Jan 16, 2019

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C error: Buffer overflow caught on CSV with chunksize #23509

C error: Buffer overflow caught on CSV with chunksize #23509

dgrahn commented Nov 5, 2018

TomAugspurger commented Nov 5, 2018

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018

dgrahn commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018

gfyoung commented Nov 6, 2018 •

edited

Loading

gfyoung commented Nov 6, 2018 •

edited

Loading

dgrahn commented Nov 6, 2018

gfyoung commented Nov 6, 2018

C error: Buffer overflow caught on CSV with chunksize #23509

C error: Buffer overflow caught on CSV with chunksize #23509

Comments

dgrahn commented Nov 5, 2018

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Nov 5, 2018

dgrahn commented Nov 5, 2018 • edited Loading

gfyoung commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018 • edited Loading

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018 • edited Loading

gfyoung commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018

dgrahn commented Nov 5, 2018 • edited Loading

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018 • edited Loading

gfyoung commented Nov 5, 2018

gfyoung commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018

dgrahn commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018

gfyoung commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018 • edited Loading

dgrahn commented Nov 5, 2018

1 column

2 columns

3 columns

4 columns

gfyoung commented Nov 5, 2018

dgrahn commented Nov 5, 2018

gfyoung commented Nov 6, 2018 • edited Loading

gfyoung commented Nov 6, 2018 • edited Loading

dgrahn commented Nov 6, 2018

gfyoung commented Nov 6, 2018

Output of `pd.show_versions()`

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 5, 2018 •

edited

Loading

dgrahn commented Nov 5, 2018 •

edited

Loading

gfyoung commented Nov 6, 2018 •

edited

Loading

gfyoung commented Nov 6, 2018 •

edited

Loading