Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C error: Buffer overflow caught on CSV with chunksize #23509

Closed
dgrahn opened this issue Nov 5, 2018 · 29 comments
Closed

C error: Buffer overflow caught on CSV with chunksize #23509

dgrahn opened this issue Nov 5, 2018 · 29 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@dgrahn
Copy link

dgrahn commented Nov 5, 2018

Code Sample

This will create the error, but it is slow. I recommend downloading the file directly.

import pandas
filename = 'https://github.com/pandas-dev/pandas/files/2548189/debug.txt'
for chunk in pandas.read_csv(filename, chunksize=1000, names=range(2504)):
    pass

Problem description

I get the following exception only while using the C engine. This is similar to #11166.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Expected Output

None. It should just loop through the file.

Output of pd.show_versions()

Both machines exhibit the exception.

RedHat
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.14.4.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.1.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Windows 7
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

Have you been able to narrow down what exactly in the linked file is causing the exception?

@gfyoung gfyoung added the IO CSV read_csv, to_csv label Nov 5, 2018
@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@TomAugspurger I have not. I'm unsure how to debug the C engine.

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@dgrahn : I have strong reason to believe that this file is actually malformed. Run this code:

with open("debug.txt", "r") as f:
   data = f.readlines()

lengths = set()

# Get row width
#
# Delimiter is definitely ","
for l in data:
   l = l.strip()
   lengths.add(len(l.split(",")))

print(lengths)

This will output:

{2304, 1154, 2054, 904, 1804, 654, 1554, 404, 2454, 1304, 154, 2204, 1054, 1954, 804, 1704, 554, 1454, 304, 2354, 1204, 54, 2104, 954, 1854, 704, 1604, 454, 2504, 1354, 204, 2254, 1104, 2004, 854, 1754, 604, 1504, 354, 2404, 1254, 104, 2154, 1004, 1904, 754, 1654, 504, 1404, 254}

If the file was correctly formatted, it should be that there is only one row width.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung It's not formatted incorrectly. It's a jagged CSV because I didn't want to bloat the file with lots of empty columns. That's why I use the names parameter.

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@dgrahn : Yes, it is, according to our definition. We need properly formatted CSV's, and that means having the same number of comma's across the board for all rows. Jagged CSV's unfortunately do not meet that criterion.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung It works when reading the entire CSV. How can I debug this for chunks? Neither saving the extra columns nor reading the entire file is a feasible option. This is already a subset of a 7 GB file.

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

It works when reading the entire CSV.

@dgrahn : Given that you mention that it's a subset, what do you mean by "entire CSV" ? Are you referring to the entire 7 GB file or all of debug.txt ? On my end, I cannot read all of debug.txt.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung When I use the following, I'm able to read the entire CSV.

pd.read_csv('debug.csv', names=range(2504))

The debug file contains the first 7k lines of a file with more than 2.6M.

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@dgrahn : I'm not sure you actually answered my question. Let me rephrase:

Are you able to read the file that you posted to GitHub in its entirety (via pd.read_csv)?

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung I'm able to read the debug file using the below code. But it fails when introducing the chunks. Does that answer the question?

pd.read_csv('debug.csv', names=range(2504))

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

Okay, got it. So I'm definitely not able to read all of debug.txt in its entirety (Ubuntu 64-bit, 0.23.4). What version of pandas are you using (and on which OS)?

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

Interestingly, when chunksize=10 it fails around line 6,810. When chunksize=100, it fails around 3100.

More details.

chunksize=1, no failure
chunksize=3, no failure
chunksize=4, failure=92-96
chunksize=5, failure=5515-5520
chunksize=10, failure= 6810-6820
chunksize=100, failure= 3100-3200

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7.

I saw, but I wasn't sure whether you meant that it worked on both environments.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

Here's a smaller file which exhibits the same behavior.
minimal.txt

import pandas as pd

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549461/minimal.txt', names=range(2504), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

Okay, so I managed to read the file in its entirety on another environment. The C engine is "filling in the blanks" thanks to the names parameter that you passed in, so while I'm still wary of the jagged CSV format, pandas is a little more generous than I recalled.

As for the discrepancies, as was already noted in the older issue, passing in engine="python" works across the board. Thus, it remains to debug the C code and see why it breaks...

(@dgrahn : BTW, that is your answer to: "how would I debug chunks")

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

Here's a smaller file which exhibits the same behavior.

@dgrahn : Oh, that's very nice! Might you by any chance be able to make the file "skinnier" ?

(the smaller the file, the easier it would be for us to test)

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung Working on it now.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung Ok. So it gets weirder. 2397 and below works, 2398 and above fails.

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549525/skinnier.txt', names=range(2397), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

print('-----')

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549525/skinnier.txt', names=range(2398), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

Each line has the following number of columns:

801
801
451
901
- chunk divider -
1001
1
201
1001

skinnier.txt

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung Ok. I have a minimal example.

minimal.txt

0
0
0
0
0
0
0
0,0
import pandas as pd

i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549561/minimal.txt', names=range(5), chunksize=4):
    print(f'{i}-{i+len(c)}')
    i += len(c)

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@dgrahn : Nice! I'm on my phone currently, so a couple of questions:

  • Can you read this file in its entirety?
  • Does reading this file in chunks work with the Python engine?

Also, why do you have to pass in names=range(5) (and not say range(2)) ?

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung Ok. I tried different chunksizes from 1-20 and columns from 2-20.

  • Reading the entire file worked for columns 2-20.
  • Python engine worked for columns 2-20
  • C engine failed for the following conditions:
chunk=2,columns=7
chunk=2,columns=15
chunk=3,columns=7
chunk=3,columns=15
chunk=4,columns=5
chunk=6,columns=7
chunk=6,columns=15

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung I've tried varying the number of columns in the last row. Here's my results.

1 column

All work.

2 columns

chunksize, columns
2, 7
2, 15
3, 7
3, 15
4, 5
6, 7
6, 15

3 columns

chunksize, columns
2, 6
2, 7
2, 14
2, 15
3, 6
3, 7
3, 14
3, 15
4, 5
4, 10
5, 7
5, 15
6, 6
6, 7
6, 14
6, 15

4 columns

chunksize, columns
2, 13
2, 14
2, 15
3, 13
3, 14
3, 15
4, 5
4, 10
5, 7
5, 15
6, 13
6, 14
6, 15

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@dgrahn : Thanks for the very thorough investigation! That is very helpful. I'll take a look at the C code later today and see what might be causing the discrepancy.

@dgrahn
Copy link
Author

dgrahn commented Nov 5, 2018

@gfyoung I tried to debug it myself by following the dev guide, but it says pandas has no attribute read_csv, so I think I better rely on your findings.

@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2018

So I think I know what's happening. In short, with the C engine, we are able to allocate and de-allocate memory as we see fit. In our attempt to optimize space consumption after reading each chunk, the parser frees up all of the space needed to read a full row (i.e. 2,504 elements).

Unfortunately, when it tries to allocate again (at least when using this dataset), it comes across one of the "skinnier" rows, causing it to under-allocate and crash with the buffer overflow error (which is a safety measure and not a core-dumping error).

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 6, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2018

@dgrahn : I was able to patch it and can now read your debug.txt dataset successfully! PR soon.

@gfyoung gfyoung added the Bug label Nov 6, 2018
@dgrahn
Copy link
Author

dgrahn commented Nov 6, 2018

Thank you! Can you point me to directions on integrating that change? Should I use a nightly build?

@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2018

@dgrahn : My changes are still being reviewed for merging into master, but if you can install the branch immediately to test on your current files.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 11, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 12, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
jreback pushed a commit that referenced this issue Nov 12, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes gh-23509.
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this issue Nov 14, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants