Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce useful guidance on overflow error in to_csv #12705

Merged
merged 10 commits into from
Feb 17, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Feb 6, 2023

Description

Since writing to CSV files is implemented by converting all columns in
a dataframe to strings, and then concatenating those columns, when we
attempt to write a large dataframe to CSV without specifying a chunk
size, we can easily overflow the maximum column size.

Currently the error message is rather inscrutable: that the requested
size of a string column exceeds the column size limit. To help the
user, catch this error and provide a useful error message that points
them towards setting the chunksize argument.

So that we don't produce false positive advice, tighten the scope by
only catching OverflowError, to do this, make partial progress
towards resolving #10200 by throwing std::overflow_error when
checking for overflow of string column lengths.

Closes #12690.

Partial progress towards rapidsai#10200, this will enable catching and
re-raising a useful overflow message in to_csv if the requested
dataframe write cannot be converted to a single string column without
overflow.
@wence- wence- added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. breaking Breaking change labels Feb 6, 2023
@wence- wence- requested review from a team as code owners February 6, 2023 11:23
@wence- wence- force-pushed the wence/fix/issue-12690 branch from 5b6efeb to 5184a8a Compare February 6, 2023 11:51
@codecov
Copy link

codecov bot commented Feb 6, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@e4ffcbb). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 169e7d0 differs from pull request most recent head 9ff488d. Consider uploading reports for the commit 9ff488d to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12705   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      158           
  Lines                   ?    25146           
  Branches                ?        0           
===============================================
  Hits                    ?    21578           
  Misses                  ?     3568           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ LGTM

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes look good!

python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved
bdice
bdice previously requested changes Feb 7, 2023
cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/detail/strings_children.cuh Outdated Show resolved Hide resolved
cpp/src/strings/regex/utilities.cuh Outdated Show resolved Hide resolved
@wence- wence- requested a review from bdice February 8, 2023 10:52
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@wence- wence- dismissed bdice’s stale review February 14, 2023 12:04

Made appropriate fixes.

@wence-
Copy link
Contributor Author

wence- commented Feb 14, 2023

/merge

@davidwendt
Copy link
Contributor

@wence-
Copy link
Contributor Author

wence- commented Feb 16, 2023

There is one more test to fix as found here:
https://github.com/rapidsai/cudf/actions/runs/4192959486/attempts/1#summary-11386077118

Thanks @davidwendt, will do!

@rapids-bot rapids-bot bot merged commit 2969b24 into rapidsai:branch-23.04 Feb 17, 2023
@wence- wence- deleted the wence/fix/issue-12690 branch February 22, 2023 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH]: mention chunksize argument when to_csv fails
6 participants