[ENH]: mention chunksize argument when to_csv fails #12690

miguelusque · 2023-02-03T11:10:11Z

Describe the bug
I am facing the following error when saving a CSV file after reading a DataFrame from parquet.

import cudf

df = cudf.read_parquet("10_000_000a_rows.parquet")
df.to_csv("10_000_000a_rows.csv",index=False)

RuntimeError Traceback (most recent call last)
/tmp/ipykernel_1285/3527798299.py in
----> 1 df.to_csv("10_000_000a_rows.csv",index=False)

/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/core/dataframe.py in to_csv(self, path_or_buf, sep, na_rep, columns, header, index, encoding, compression, line_terminator, chunksize, storage_options)
6332 from cudf.io import csv
6333
-> 6334 return csv.to_csv(
6335 self,
6336 path_or_buf=path_or_buf,

/opt/conda/envs/rapids/lib/python3.9/contextlib.py in inner(*args, **kwds)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
80 return inner
81

/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/io/csv.py in to_csv(df, path_or_buf, sep, na_rep, columns, header, index, encoding, compression, line_terminator, chunksize, storage_options)
239 )
240 else:
--> 241 libcudf.csv.write_csv(
242 df,
243 path_or_buf=path_or_buf,

csv.pyx in cudf._lib.csv.write_csv()

csv.pyx in cudf._lib.csv.write_csv()

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/column/column_factories.cpp:82: Column size cannot be negative.

Steps/Code to reproduce bug
The file is too big to share here. I will share the file with @beckernick internally.

Expected behavior
No error.

Environment overview (please complete the following information)
RAPIDS 22.12 using NGC container.

Environment details
DGX-1 Server

wence- · 2023-02-03T11:51:49Z

To avoid this error, write using a specified number of rows at a time by providing chunksize=N to to_csv.

This error is occuring because although any single column does not reach the max row count in cudf, the way a CSV file is written is for all columns to be converted to string type in a table, and then all of those columns are concatenated together into a single string column containing the data for the entire dataframe (this is then written to file). By providing a chunksize argument, you're requesting that this concatenation happens N rows at a time, which helps to control the total size of the output string column.

wence- · 2023-02-03T12:27:00Z

Perhaps we should catch this error in cudf and raise a more informative one (suggesting to specify chunksize).

Closes rapidsai#12690.

Since writing to CSV files is implemented by converting all columns in a dataframe to strings, and then concatenating those columns, when we attempt to write a large dataframe to CSV without specifying a chunk size, we can easily overflow the maximum column size. Currently the error message is rather inscrutable: that the requested size of a string column exceeds the column size limit. To help the user, catch this error and provide a useful error message that points them towards setting the `chunksize` argument. So that we don't produce false positive advice, tighten the scope by only catching `OverflowError`, to do this, make partial progress towards resolving #10200 by throwing `std::overflow_error` when checking for overflow of string column lengths. Closes #12690. Authors: - Lawrence Mitchell (https://github.com/wence-) - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Ashwin Srinath (https://github.com/shwina) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #12705

miguelusque added Needs Triage Need team to review and classify bug Something isn't working labels Feb 3, 2023

wence- changed the title ~~[BUG] Error when saving CSV file: Column size cannot be negative~~ [ENH]: mention chunksize argument when to_csv fails Feb 3, 2023

wence- self-assigned this Feb 3, 2023

wence- added improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 3, 2023

bdice mentioned this issue Feb 3, 2023

Fix missing trailing comma in json writer #12688

Merged

3 tasks

wence- mentioned this issue Feb 6, 2023

Produce useful guidance on overflow error in to_csv #12705

Merged

wence- added a commit to wence-/cudf that referenced this issue Feb 6, 2023

Catch OverflowError in to_csv and provide useful advice

5184a8a

Closes rapidsai#12690.

rapids-bot bot closed this as completed in #12705 Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]: mention chunksize argument when to_csv fails #12690

[ENH]: mention chunksize argument when to_csv fails #12690

miguelusque commented Feb 3, 2023

wence- commented Feb 3, 2023

wence- commented Feb 3, 2023

[ENH]: mention chunksize argument when to_csv fails #12690

[ENH]: mention chunksize argument when to_csv fails #12690

Comments

miguelusque commented Feb 3, 2023

wence- commented Feb 3, 2023

wence- commented Feb 3, 2023