Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an int8 column in read_csv when all elements are missing #12110

Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 9, 2022

Description

CSV reader creates int8 columns when all elements are null. However, when all elements in a
column are missing (e.g. names option includes more columns than the CSV file), CSV reader creates an int64 column. Such columns take up a lot more device memory.

This PR changes the behavior so that all columns with no valid elements are created as int8.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule self-assigned this Nov 9, 2022
@vuule vuule added improvement Improvement / enhancement to an existing function breaking Breaking change labels Nov 9, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2022
@vuule vuule added the cuIO cuIO issue label Nov 9, 2022
@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Base: 87.47% // Head: 88.12% // Increases project coverage by +0.64% 🎉

Coverage data is based on head (809a19e) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12110      +/-   ##
================================================
+ Coverage         87.47%   88.12%   +0.64%     
================================================
  Files               133      135       +2     
  Lines             21826    22133     +307     
================================================
+ Hits              19093    19504     +411     
+ Misses             2733     2629     -104     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vuule vuule marked this pull request as ready for review November 10, 2022 19:32
@vuule vuule requested a review from a team as a code owner November 10, 2022 19:32
@vuule vuule requested review from harrism and nvdbaranec November 10, 2022 19:32
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice small fix. Thanks!

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 14, 2022
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@@ -33,6 +33,11 @@ struct column_type_histogram {
cudf::size_type positive_small_int_count{};
cudf::size_type big_int_count{};
cudf::size_type bool_count{};
auto total_count() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for the future -- seems like this struct would be more versatile and more easily extendible (and totals easier to calculate in a bug-free way) if you store a std::array (or similar) of counts and index it using constant names for indices. Then this line could just be a std::accumulate (or similar).

@vuule
Copy link
Contributor Author

vuule commented Nov 15, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b2e5069 into rapidsai:branch-22.12 Nov 15, 2022
@vuule vuule deleted the bug-read_csv-empty=column-type branch November 15, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants