-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: mention chunksize argument when to_csv fails #12690
Comments
To avoid this error, write using a specified number of rows at a time by providing This error is occuring because although any single column does not reach the max row count in cudf, the way a CSV file is written is for all columns to be converted to string type in a table, and then all of those columns are concatenated together into a single string column containing the data for the entire dataframe (this is then written to file). By providing a chunksize argument, you're requesting that this concatenation happens |
Perhaps we should catch this error in cudf and raise a more informative one (suggesting to specify |
Since writing to CSV files is implemented by converting all columns in a dataframe to strings, and then concatenating those columns, when we attempt to write a large dataframe to CSV without specifying a chunk size, we can easily overflow the maximum column size. Currently the error message is rather inscrutable: that the requested size of a string column exceeds the column size limit. To help the user, catch this error and provide a useful error message that points them towards setting the `chunksize` argument. So that we don't produce false positive advice, tighten the scope by only catching `OverflowError`, to do this, make partial progress towards resolving #10200 by throwing `std::overflow_error` when checking for overflow of string column lengths. Closes #12690. Authors: - Lawrence Mitchell (https://github.com/wence-) - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Ashwin Srinath (https://github.com/shwina) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #12705
Describe the bug
I am facing the following error when saving a CSV file after reading a DataFrame from parquet.
Steps/Code to reproduce bug
The file is too big to share here. I will share the file with @beckernick internally.
Expected behavior
No error.
Environment overview (please complete the following information)
RAPIDS 22.12 using NGC container.
Environment details
DGX-1 Server
The text was updated successfully, but these errors were encountered: