Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change cudf::test::make_null_mask to also return null-count #13081

Merged
merged 15 commits into from
Apr 14, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Apr 6, 2023

Description

Change the cudf::test::make_null_mask to return both the null-mask and the null-count. Callers can then use this null-count instead of UNKNOWN_NULL_COUNT. These changes include removing UNKNOWN_NULL_COUNT usage from the libcudf C++ test source code.

One side-effect found that strings column with all nulls can technically have no children but using UNKNOWN_NULL_COUNT allowed the check for this to be bypassed. Therefore many utilities started to fail when UNKNOWN_NULL_COUNT was removed. The factory was modified to remove the check which results in an offsets column and an empty chars column as children.

More code will likely need to be change when the UNKNOWN_NULL_COUNT is no longer used as a default parameter for factories and other column functions.

No behavior is changed. Since the cudf::test::make_null_mask is technically a public API, this PR could be marked as a breaking change as well.

Contributes to: #11968

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 6, 2023
@davidwendt davidwendt self-assigned this Apr 6, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 10, 2023
@davidwendt davidwendt marked this pull request as ready for review April 10, 2023 12:49
@davidwendt davidwendt requested a review from a team as a code owner April 10, 2023 12:49
@davidwendt davidwendt requested review from vyasr and mythrocks April 10, 2023 12:49
rapids-bot bot pushed a commit that referenced this pull request Apr 11, 2023
Removes `using namespace cudf;` from gtests source code to make it easier to read -- find where utilities and function calls are implemented. Also removed a few `using namespace cudf::test;` usages which by extension includes namespace `cudf`.

Found these while working on #13081
Reference #11734

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #13089
@davidwendt davidwendt requested review from a team as code owners April 11, 2023 14:57
@github-actions github-actions bot added CMake CMake build issue Java Affects Java cuDF API. labels Apr 11, 2023
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review. LGTM! A couple of minor nitpicks.

cpp/include/cudf_test/column_wrapper.hpp Show resolved Hide resolved
cpp/tests/copying/scatter_list_tests.cpp Show resolved Hide resolved
cpp/tests/interop/from_arrow_test.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a detail API this is used a ton in tests... In the end I suspect we're going to need to expose a number of our currently detail-only mask APIs publicly. Anyway this LGTM for now.

Comment on lines +257 to +276
thrust::host_vector<std::string> host_data(c.size());
if (c.size() > c.null_count()) {
auto const scv = strings_column_view(c);
auto const h_chars = cudf::detail::make_std_vector_sync<char>(
cudf::device_span<char const>(scv.chars().data<char>(), scv.chars().size()),
cudf::get_default_stream());
auto const h_offsets = cudf::detail::make_std_vector_sync(
cudf::device_span<cudf::offset_type const>(
scv.offsets().data<cudf::offset_type>() + scv.offset(), scv.size() + 1),
cudf::get_default_stream());

// build std::string vector from chars and offsets
std::transform(
std::begin(h_offsets),
std::end(h_offsets) - 1,
std::begin(h_offsets) + 1,
host_data.begin(),
[&](auto start, auto end) { return std::string(h_chars.data() + start, end - start); });
}
return {std::move(host_data), bitmask_to_host(c)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git diff here is a bit hard to read, but IIUC this whole change is just to preallocate the vector and then early return if you don't have any non-null entries to write, is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.

cpp/include/cudf_test/column_wrapper.hpp Outdated Show resolved Hide resolved

int begin_bit = 0;
int end_bit = 800;
auto gold_splice_mask = cudf::test::detail::make_null_mask(validity_bit.begin() + begin_bit,
validity_bit.begin() + end_bit);
auto gold_splice_mask = std::get<0>(cudf::test::detail::make_null_mask(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our of curiosity, is there a reason you prefer std::get<0> to .first for pairs? For consistency with tuples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this gets us the first value from the structured binding without an unused variable and without resorting to the vagueness of .first and .second.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4481142 into rapidsai:branch-23.06 Apr 14, 2023
@davidwendt davidwendt deleted the make-null-mask-count branch April 14, 2023 21:38
rapids-bot bot pushed a commit that referenced this pull request Apr 17, 2023
)

Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column.

Found while working on #13081
Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13107
shwina pushed a commit to shwina/cudf that referenced this pull request Apr 18, 2023
…idsai#13107)

Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column.

Found while working on rapidsai#13081
Contributes to: rapidsai#11968

Authors:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#13107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants