Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add large strings support in CSV writer #16137

Closed
GregoryKimball opened this issue Jun 30, 2024 · 1 comment · Fixed by #16148
Closed

[FEA] Add large strings support in CSV writer #16137

GregoryKimball opened this issue Jun 30, 2024 · 1 comment · Fixed by #16148
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The libcudf CSV writer throws a std::overflow_error when trying to file a dataframe or a chunk that exceeds 2.1B characters.

Describe the solution you'd like
There are a few options:

  1. fix the overflow error in the CSV writer implementation. This would be a libcudf change. Since the CSV writer is fairly simple and relies on libcudf strings API, there may be a straightforward solution.
  2. Change the default chunksize to libcudf or cuDF from std::numeric_limits<size_type>::max(); to something based on the dataframe size, e.g. chunksize = len(df) // (df.memory_usage(deep=True).sum() / 500_000_000.)

Additional context
As far as fixing the root cause, here is a quick repro:

    df = cudf.DataFrame({'text': ['a'] * 1000})
    df['text'] = df['text'].str.repeat(1000)
    data_1gb = cudf.concat([df] * 1000, ignore_index=True)
    df = cudf.concat([data_1gb] * 3, ignore_index=True)
    df.to_csv('/raid/tmp.csv')

which throws:

  File "csv.pyx", line 547, in cudf._lib.csv.write_csv
OverflowError: Writing CSV file with chunksize=3000000 failed. Consider providing a smaller chunksize argument.

but I couldn't immediately find where the std::overflow_error is getting thrown in io/csv/writer_impl.cu.

As far as chunksize, I did some analysis of chunksize, and using smaller chunks of ~100 MB shouldn't have a significant performance impact.
image

@GregoryKimball GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jun 30, 2024
@davidwendt
Copy link
Contributor

I know where this is occurring and will look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants