[FEA] Add large strings support in CSV writer #16137

GregoryKimball · 2024-06-30T21:05:35Z

Is your feature request related to a problem? Please describe.
The libcudf CSV writer throws a std::overflow_error when trying to file a dataframe or a chunk that exceeds 2.1B characters.

Describe the solution you'd like
There are a few options:

fix the overflow error in the CSV writer implementation. This would be a libcudf change. Since the CSV writer is fairly simple and relies on libcudf strings API, there may be a straightforward solution.
Change the default chunksize to libcudf or cuDF from std::numeric_limits<size_type>::max(); to something based on the dataframe size, e.g. chunksize = len(df) // (df.memory_usage(deep=True).sum() / 500_000_000.)

Additional context
As far as fixing the root cause, here is a quick repro:

    df = cudf.DataFrame({'text': ['a'] * 1000})
    df['text'] = df['text'].str.repeat(1000)
    data_1gb = cudf.concat([df] * 1000, ignore_index=True)
    df = cudf.concat([data_1gb] * 3, ignore_index=True)
    df.to_csv('/raid/tmp.csv')

which throws:

  File "csv.pyx", line 547, in cudf._lib.csv.write_csv
OverflowError: Writing CSV file with chunksize=3000000 failed. Consider providing a smaller chunksize argument.

but I couldn't immediately find where the std::overflow_error is getting thrown in io/csv/writer_impl.cu.

As far as chunksize, I did some analysis of chunksize, and using smaller chunks of ~100 MB shouldn't have a significant performance impact.

The text was updated successfully, but these errors were encountered:

davidwendt · 2024-07-01T12:46:36Z

I know where this is occurring and will look into this.

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jun 30, 2024

GregoryKimball added this to libcudf Jun 30, 2024

davidwendt self-assigned this Jul 1, 2024

davidwendt mentioned this issue Jul 1, 2024

Use strings concatenate to support large strings in CSV writer #16148

Merged

3 tasks

rapids-bot bot closed this as completed in #16148 Jul 5, 2024

rapids-bot bot closed this as completed in 37defc6 Jul 5, 2024

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add large strings support in CSV writer #16137

[FEA] Add large strings support in CSV writer #16137

GregoryKimball commented Jun 30, 2024

davidwendt commented Jul 1, 2024

[FEA] Add large strings support in CSV writer #16137

[FEA] Add large strings support in CSV writer #16137

Comments

GregoryKimball commented Jun 30, 2024

davidwendt commented Jul 1, 2024