Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRUCT column support for cudf::merge. #8422

Merged
merged 7 commits into from
Jun 16, 2021

Conversation

nvdbaranec
Copy link
Contributor

Partially addresses #8050

Adds support for merging of struct columns. The struct columns cannot be used as keys in the merge.

@nvdbaranec nvdbaranec added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jun 1, 2021
@nvdbaranec nvdbaranec requested a review from a team as a code owner June 1, 2021 19:32
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update copyright

cpp/src/merge/merge.cu Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec requested a review from rgsl888prabhu June 3, 2021 19:58
@nvdbaranec
Copy link
Contributor Author

rerun tests

@codecov
Copy link

codecov bot commented Jun 14, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@d30695c). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 25988fd differs from pull request most recent head 9113c5f. Consider uploading reports for the commit 9113c5f to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8422   +/-   ##
===============================================
  Coverage                ?   82.91%           
===============================================
  Files                   ?      110           
  Lines                   ?    18094           
  Branches                ?        0           
===============================================
  Hits                    ?    15002           
  Misses                  ?     3092           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d30695c...9113c5f. Read the comment docs.

cpp/src/merge/merge.cu Outdated Show resolved Hide resolved
Comment on lines +378 to +389
rmm::device_buffer validity =
lcol.has_nulls() || rcol.has_nulls()
? create_null_mask(merged_size, mask_state::UNINITIALIZED, stream, mr)
: rmm::device_buffer{};
if (lcol.has_nulls() || rcol.has_nulls()) {
materialize_bitmask(lcol,
rcol,
static_cast<bitmask_type*>(validity.data()),
merged_size,
row_order_.data(),
stream);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use is_nullable here and let the caller be responsible for introspecting the data. If either of the inputs are nullable, the output should be nullable, regardless of whether there any invalid elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely not consistent with how cudf works though. We try and drop validity whenever possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a mixed bag whether we do or not, but I think there's a strong argument for letting the user decide. Everywhere else we strongly prefer to avoid data introspection, and null-counting is data introspection. Switching to is_nullable means we are no longer introspecting the validity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad Pinging Jake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because there's no point in blocking on this.

Copy link
Contributor

@cwharris cwharris Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to start a general discussion.

Copy link
Contributor Author

@nvdbaranec nvdbaranec Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the meta here is that we should be expecting null count to be already computed more often than not, so at least on the performance side it's theoretically free.

@nvdbaranec nvdbaranec requested a review from cwharris June 14, 2021 22:06
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one small recommended change.

cpp/src/merge/merge.cu Show resolved Hide resolved
@codereport
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 468c573 into rapidsai:branch-21.08 Jun 16, 2021
@codereport codereport mentioned this pull request Jun 16, 2021
rapids-bot bot pushed a commit that referenced this pull request Jun 16, 2021
Small cleanup post #8422

Authors:
  - Conor Hoekstra (https://github.com/codereport)

Approvers:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Mark Harris (https://github.com/harrism)

URL: #8534
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants