Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue of empty lists having empty offsets #10935

Merged

Conversation

devavret
Copy link
Contributor

@devavret devavret commented May 23, 2022

Added special handling for list cases where empty lists have empty offsets. Generally the size of offsets column in a list column is list column size + 1. For empty list columns, offsets are allowed to be empty instead of size 1. This caused issues in processing lists in the parquet writer.

This PR makes use of the list_column_device_view wrapper to handle this case wherever possible and replaces empty list columns with a temporary list column with offset of size 1 where list_column_device_view could not be used.

Fixes #10536

@devavret devavret requested a review from a team as a code owner May 23, 2022 21:07
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 23, 2022
@devavret devavret requested a review from vuule May 23, 2022 21:07
@vuule vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue labels May 23, 2022
@codecov
Copy link

codecov bot commented May 23, 2022

Codecov Report

Merging #10935 (a9b04ed) into branch-22.06 (54789ee) will increase coverage by 0.02%.
The diff coverage is 93.75%.

❗ Current head a9b04ed differs from pull request most recent head 4368496. Consider uploading reports for the commit 4368496 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10935      +/-   ##
================================================
+ Coverage         86.30%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22665    22668       +3     
================================================
+ Hits              19560    19569       +9     
+ Misses             3105     3099       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/utils/ioutils.py 79.47% <87.50%> (-0.13%) ⬇️
python/cudf/cudf/io/avro.py 78.57% <100.00%> (ø)
python/cudf/cudf/io/csv.py 91.80% <100.00%> (ø)
python/cudf/cudf/io/json.py 97.56% <100.00%> (ø)
python/cudf/cudf/io/orc.py 92.77% <100.00%> (ø)
python/cudf/cudf/io/parquet.py 90.83% <100.00%> (ø)
python/cudf/cudf/io/text.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6acf226...4368496. Read the comment docs.

@devavret
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@PointKernel
Copy link
Member

rerun tests

@davidwendt
Copy link
Contributor

Please add a description to this PR since it will appear in the changelog.

@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label May 24, 2022
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after adding PR description.

@devavret devavret requested a review from PointKernel May 25, 2022 07:34
std::unique_ptr<column> empty_list_offset_col;
if (has_empty_list_offsets) {
empty_list_offset_col = make_fixed_width_column(data_type(type_id::INT32), 1);
cudaMemsetAsync(empty_list_offset_col->mutable_view().head(), 0, sizeof(size_type), stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be done with a simple column_view wrapper around a pointer to
__constant__ int32_t empty_list_offset_data = 0;
This is probably not worth doing for a single value just to save the Memset call.

@raydouglass raydouglass merged commit a57d1e1 into rapidsai:branch-22.06 May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet writes of a LIST of LISTS with a null validity crashes
6 participants