-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C error: Buffer overflow caught on CSV with chunksize #23509
Comments
Have you been able to narrow down what exactly in the linked file is causing the exception? |
@TomAugspurger I have not. I'm unsure how to debug the C engine. |
@dgrahn : I have strong reason to believe that this file is actually malformed. Run this code: with open("debug.txt", "r") as f:
data = f.readlines()
lengths = set()
# Get row width
#
# Delimiter is definitely ","
for l in data:
l = l.strip()
lengths.add(len(l.split(",")))
print(lengths) This will output: {2304, 1154, 2054, 904, 1804, 654, 1554, 404, 2454, 1304, 154, 2204, 1054, 1954, 804, 1704, 554, 1454, 304, 2354, 1204, 54, 2104, 954, 1854, 704, 1604, 454, 2504, 1354, 204, 2254, 1104, 2004, 854, 1754, 604, 1504, 354, 2404, 1254, 104, 2154, 1004, 1904, 754, 1654, 504, 1404, 254} If the file was correctly formatted, it should be that there is only one row width. |
@gfyoung It's not formatted incorrectly. It's a jagged CSV because I didn't want to bloat the file with lots of empty columns. That's why I use the |
@dgrahn : Yes, it is, according to our definition. We need properly formatted CSV's, and that means having the same number of comma's across the board for all rows. Jagged CSV's unfortunately do not meet that criterion. |
@gfyoung It works when reading the entire CSV. How can I debug this for chunks? Neither saving the extra columns nor reading the entire file is a feasible option. This is already a subset of a 7 GB file. |
@dgrahn : Given that you mention that it's a subset, what do you mean by "entire CSV" ? Are you referring to the entire 7 GB file or all of |
@gfyoung When I use the following, I'm able to read the entire CSV.
The debug file contains the first 7k lines of a file with more than 2.6M. |
@dgrahn : I'm not sure you actually answered my question. Let me rephrase: Are you able to read the file that you posted to GitHub in its entirety (via |
@gfyoung I'm able to read the debug file using the below code. But it fails when introducing the chunks. Does that answer the question?
|
Okay, got it. So I'm definitely not able to read all of |
@gfyoung Details are included in the original post. Both Windows 7 and RedHat. 0.23.4 on RedHat, 0.23.0 on Windows 7. |
Interestingly, when More details.
|
I saw, but I wasn't sure whether you meant that it worked on both environments. |
Here's a smaller file which exhibits the same behavior. import pandas as pd
i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549461/minimal.txt', names=range(2504), chunksize=4):
print(f'{i}-{i+len(c)}')
i += len(c) |
Okay, so I managed to read the file in its entirety on another environment. The C engine is "filling in the blanks" thanks to the As for the discrepancies, as was already noted in the older issue, passing in (@dgrahn : BTW, that is your answer to: "how would I debug chunks") |
@dgrahn : Oh, that's very nice! Might you by any chance be able to make the file "skinnier" ? (the smaller the file, the easier it would be for us to test) |
@gfyoung Working on it now. |
@gfyoung Ok. So it gets weirder. 2397 and below works, 2398 and above fails. i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549525/skinnier.txt', names=range(2397), chunksize=4):
print(f'{i}-{i+len(c)}')
i += len(c)
print('-----')
i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549525/skinnier.txt', names=range(2398), chunksize=4):
print(f'{i}-{i+len(c)}')
i += len(c) Each line has the following number of columns:
|
@gfyoung Ok. I have a minimal example.
import pandas as pd
i = 0
for c in pd.read_csv('https://github.com/pandas-dev/pandas/files/2549561/minimal.txt', names=range(5), chunksize=4):
print(f'{i}-{i+len(c)}')
i += len(c) |
@dgrahn : Nice! I'm on my phone currently, so a couple of questions:
Also, why do you have to pass in |
@gfyoung Ok. I tried different
|
@gfyoung I've tried varying the number of columns in the last row. Here's my results. 1 columnAll work. 2 columns
3 columns
4 columns
|
@dgrahn : Thanks for the very thorough investigation! That is very helpful. I'll take a look at the C code later today and see what might be causing the discrepancy. |
@gfyoung I tried to debug it myself by following the dev guide, but it says pandas has no attribute |
So I think I know what's happening. In short, with the C engine, we are able to allocate and de-allocate memory as we see fit. In our attempt to optimize space consumption after reading each chunk, the parser frees up all of the space needed to read a full row (i.e. 2,504 elements). Unfortunately, when it tries to allocate again (at least when using this dataset), it comes across one of the "skinnier" rows, causing it to under-allocate and crash with the buffer overflow error (which is a safety measure and not a core-dumping error). |
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
@dgrahn : I was able to patch it and can now read your |
Thank you! Can you point me to directions on integrating that change? Should I use a nightly build? |
@dgrahn : My changes are still being reviewed for merging into |
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes gh-23509.
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.
Code Sample
This will create the error, but it is slow. I recommend downloading the file directly.
Problem description
I get the following exception only while using the C engine. This is similar to #11166.
Expected Output
None. It should just loop through the file.
Output of
pd.show_versions()
Both machines exhibit the exception.
RedHat
Windows 7
The text was updated successfully, but these errors were encountered: