Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: mention chunksize argument when to_csv fails #12690

Closed
miguelusque opened this issue Feb 3, 2023 · 2 comments · Fixed by #12705
Closed

[ENH]: mention chunksize argument when to_csv fails #12690

miguelusque opened this issue Feb 3, 2023 · 2 comments · Fixed by #12705
Assignees
Labels
bug Something isn't working improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@miguelusque
Copy link
Member

Describe the bug
I am facing the following error when saving a CSV file after reading a DataFrame from parquet.

import cudf

df = cudf.read_parquet("10_000_000a_rows.parquet")
df.to_csv("10_000_000a_rows.csv",index=False)

RuntimeError Traceback (most recent call last)
/tmp/ipykernel_1285/3527798299.py in
----> 1 df.to_csv("10_000_000a_rows.csv",index=False)

/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/core/dataframe.py in to_csv(self, path_or_buf, sep, na_rep, columns, header, index, encoding, compression, line_terminator, chunksize, storage_options)
6332 from cudf.io import csv
6333
-> 6334 return csv.to_csv(
6335 self,
6336 path_or_buf=path_or_buf,

/opt/conda/envs/rapids/lib/python3.9/contextlib.py in inner(*args, **kwds)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
80 return inner
81

/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/io/csv.py in to_csv(df, path_or_buf, sep, na_rep, columns, header, index, encoding, compression, line_terminator, chunksize, storage_options)
239 )
240 else:
--> 241 libcudf.csv.write_csv(
242 df,
243 path_or_buf=path_or_buf,

csv.pyx in cudf._lib.csv.write_csv()

csv.pyx in cudf._lib.csv.write_csv()

RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/column/column_factories.cpp:82: Column size cannot be negative.

Steps/Code to reproduce bug
The file is too big to share here. I will share the file with @beckernick internally.

Expected behavior
No error.

Environment overview (please complete the following information)
RAPIDS 22.12 using NGC container.

Environment details
DGX-1 Server

@miguelusque miguelusque added Needs Triage Need team to review and classify bug Something isn't working labels Feb 3, 2023
@wence-
Copy link
Contributor

wence- commented Feb 3, 2023

To avoid this error, write using a specified number of rows at a time by providing chunksize=N to to_csv.

This error is occuring because although any single column does not reach the max row count in cudf, the way a CSV file is written is for all columns to be converted to string type in a table, and then all of those columns are concatenated together into a single string column containing the data for the entire dataframe (this is then written to file). By providing a chunksize argument, you're requesting that this concatenation happens N rows at a time, which helps to control the total size of the output string column.

@wence-
Copy link
Contributor

wence- commented Feb 3, 2023

Perhaps we should catch this error in cudf and raise a more informative one (suggesting to specify chunksize).

@wence- wence- changed the title [BUG] Error when saving CSV file: Column size cannot be negative [ENH]: mention chunksize argument when to_csv fails Feb 3, 2023
@wence- wence- self-assigned this Feb 3, 2023
@wence- wence- added improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 3, 2023
wence- added a commit to wence-/cudf that referenced this issue Feb 6, 2023
rapids-bot bot pushed a commit that referenced this issue Feb 17, 2023
Since writing to CSV files is implemented by converting all columns in
a dataframe to strings, and then concatenating those columns, when we
attempt to write a large dataframe to CSV without specifying a chunk
size, we can easily overflow the maximum column size.

Currently the error message is rather inscrutable: that the requested
size of a string column exceeds the column size limit. To help the
user, catch this error and provide a useful error message that points
them towards setting the `chunksize` argument.

So that we don't produce false positive advice, tighten the scope by
only catching `OverflowError`, to do this, make partial progress
towards resolving #10200 by throwing `std::overflow_error` when
checking for overflow of string column lengths.

Closes #12690.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Ashwin Srinath (https://github.com/shwina)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #12705
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants