Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support structs of lists in row lexicographic comparator #13005

Merged
merged 25 commits into from
Apr 10, 2023
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/include/cudf/table/experimental/row_operators.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -397,7 +397,7 @@ class device_row_comparator {
}

if (lcol.num_child_columns() == 0) {
return cuda::std::pair(weak_ordering::EQUIVALENT, depth);
return cuda::std::pair(weak_ordering::EQUIVALENT, std::numeric_limits<int>::max());
vyasr marked this conversation as resolved.
Show resolved Hide resolved
}

// Non-empty structs have been modified to only have 1 child when using this.
Expand Down
39 changes: 33 additions & 6 deletions cpp/src/table/row_operators.cu
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,20 @@ table_view remove_struct_child_offsets(table_view table)
/**
* @brief Decompose all struct columns in a table
*
* If a struct column is a tree with N leaves, then this function decomposes the tree into
* If a structs column is a tree with N leaves, then this function decomposes the tree into
* N "linear trees" (branch factor == 1) and prunes common parents. Also returns a vector of
* per-column `depth`s.
*
* A `depth` value is the number of nested levels as parent of the column in the original,
* non-decomposed table, which are pruned during decomposition.
*
* Special handling is needed in the cases of structs column having lists as its first child. In
* such situations, the function decomposes the tree of N leaves into N+1 linear trees in which the
* second tree was generated by extracting out leaf of the first tree. This is to make sure there is
* no structs column having child lists column in the output. Note that structs with lists children
* in subsequent positions do not require any special treatment because the struct parent will be
* pruned for all subsequent children.
*
* For example, if the original table has a column `Struct<Struct<int, float>, decimal>`,
*
* S1
Expand All @@ -113,7 +120,7 @@ table_view remove_struct_child_offsets(table_view table)
* The depth of the first column is 0 because it contains all its parent levels, while the depth
* of the second column is 2 because two of its parent struct levels were pruned.
*
* Similarly, a struct column of type Struct<int, Struct<float, decimal>> is decomposed as follows
* Similarly, a struct column of type `Struct<int, Struct<float, decimal>>` is decomposed as follows
*
* S1
* / \
Expand Down Expand Up @@ -148,6 +155,10 @@ table_view remove_struct_child_offsets(table_view table)
* The list parents are still needed to define the range of elements in the leaf that belong to the
* same row.
*
* In the case of structs column having its first child is a lists column such as
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
* `Struct<List<int>, float>`, after decomposition we get three columns `Struct<>`,
* `List<int>`, and `float`.
*
* @param table The table whose struct columns to decompose.
* @param column_order The per-column order if using output with lexicographic comparison
* @param null_precedence The per-column null precedence
Expand Down Expand Up @@ -180,7 +191,12 @@ auto decompose_structs(table_view table,
c->children[lists_column_view::child_column_index].get(), branch, depth + 1);
} else if (c->type().id() == type_id::STRUCT) {
for (size_t child_idx = 0; child_idx < c->children.size(); ++child_idx) {
if (child_idx > 0) {
// When child_idx == 0, we also cut off the current branch if its first child is a
// lists column.
// In such cases, the last column of the current branch will be `Struct<List,...>` and
// it will be modified to empty struct type `Struct<>` later on.
if (child_idx > 0 ||
(child_idx == 0 && c->children[0]->type().id() == type_id::LIST)) {
vyasr marked this conversation as resolved.
Show resolved Hide resolved
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
verticalized_col_depths.push_back(depth + 1);
branch = &flattened.emplace_back();
}
Expand All @@ -194,6 +210,19 @@ auto decompose_structs(table_view table,

for (auto const& branch : flattened) {
column_view temp_col = *branch.back();

// Change `Struct<List,...>` into empty struct type `Struct<>`.
if (temp_col.type().id() == type_id::STRUCT &&
(temp_col.num_children() > 0 && temp_col.child(0).type().id() == type_id::LIST)) {
temp_col = column_view(temp_col.type(),
temp_col.size(),
temp_col.head(),
temp_col.null_mask(),
UNKNOWN_NULL_COUNT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the null count be known here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be queried by temp_col.null_count() which may trigger a kernel launch (now). Good point, I think with our plan to remove UNKNOWN_NULL_COUNT then using temp_col.null_count() here would avoid modifying it again in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is also addressed in #13102.

temp_col.offset(),
{});
}

for (auto it = branch.crbegin() + 1; it < branch.crend(); ++it) {
auto const& prev_col = *(*it);
auto children =
Expand Down Expand Up @@ -260,7 +289,7 @@ auto decompose_structs(table_view table,
* This helper function generates dremel data for any list-type columns in a
* table. This data is necessary for lexicographic comparisons.
*/
auto list_lex_preprocess(table_view table, rmm::cuda_stream_view stream)
auto list_lex_preprocess(table_view const& table, rmm::cuda_stream_view stream)
{
std::vector<detail::dremel_data> dremel_data;
std::vector<detail::dremel_device_view> dremel_device_views;
Expand Down Expand Up @@ -293,8 +322,6 @@ void check_lex_compatibility(table_view const& input)
check_column(list_col.child());
} else if (c.type().id() == type_id::STRUCT) {
for (auto child = c.child_begin(); child < c.child_end(); ++child) {
CUDF_EXPECTS(child->type().id() != type_id::LIST,
"Cannot lexicographic compare a table with a STRUCT of LIST column");
check_column(*child);
}
}
Expand Down
4 changes: 2 additions & 2 deletions cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -295,8 +295,8 @@ endif()
# ##################################################################################################
# * sort tests ------------------------------------------------------------------------------------
ConfigureTest(
SORT_TEST sort/segmented_sort_tests.cpp sort/sort_test.cpp sort/stable_sort_tests.cpp
sort/rank_test.cpp
SORT_TEST sort/segmented_sort_tests.cpp sort/sort_nested_types_tests.cpp sort/sort_test.cpp
sort/stable_sort_tests.cpp sort/rank_test.cpp
GPUS 1
PERCENT 70
)
Expand Down
12 changes: 10 additions & 2 deletions cpp/tests/groupby/structs_tests.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -294,11 +294,12 @@ TYPED_TEST(groupby_structs_test, all_null_input)
test_sum_agg(keys, values, expected_keys, expected_values);
}

TYPED_TEST(groupby_structs_test, lists_are_unsupported)
TYPED_TEST(groupby_structs_test, lists_as_keys)
{
using V = int32_t; // Type of Aggregation Column.
using M0 = int32_t; // Type of STRUCT's first (i.e. 0th) member.
using M1 = TypeParam; // Type of STRUCT's second (i.e. 1th) member.
using R = cudf::detail::target_type_t<V, cudf::aggregation::SUM>;

// clang-format off
auto values = fwcw<V> { 0, 1, 2, 3, 4 };
Expand All @@ -307,5 +308,12 @@ TYPED_TEST(groupby_structs_test, lists_are_unsupported)
// clang-format on
auto keys = cudf::test::structs_column_wrapper{{member_0, member_1}};

EXPECT_THROW(test_sum_agg(keys, values, keys, values), cudf::logic_error);
// clang-format off
auto expected_values = fwcw<R> { 3, 5, 2 };
auto expected_member_0 = lcw<M0> { {1,1}, {2,2}, {3,3} };
auto expected_member_1 = fwcw<M1>{ 1, 2, 3 };
// clang-format on
auto expected_keys = cudf::test::structs_column_wrapper{{expected_member_0, expected_member_1}};

test_sum_agg(keys, values, expected_keys, expected_values);
}
181 changes: 181 additions & 0 deletions cpp/tests/sort/sort_nested_types_tests.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_wrapper.hpp>
#include <cudf_test/iterator_utilities.hpp>
#include <cudf_test/type_lists.hpp>

#include <cudf/copying.hpp>
#include <cudf/sorting.hpp>

using int32s_lists = cudf::test::lists_column_wrapper<int32_t>;
using int32s_col = cudf::test::fixed_width_column_wrapper<int32_t>;
using strings_col = cudf::test::strings_column_wrapper;
using structs_col = cudf::test::structs_column_wrapper;

using namespace cudf::test::iterators;

constexpr auto null{0};

struct NestedStructTest : public cudf::test::BaseFixture {
};

TEST_F(NestedStructTest, SimpleStructsOfListsNoNulls)
{
auto const input = [] {
auto child = int32s_lists{{4, 2, 0}, {2}, {0, 5}, {1, 5}, {4, 1}};
return structs_col{{child}};
}();

{
auto const expected_order = int32s_col{2, 3, 1, 4, 0};
auto const order = cudf::sorted_order(cudf::table_view{{input}});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}

{
auto const expected_order = int32s_col{0, 4, 1, 3, 2};
auto const order = cudf::sorted_order(cudf::table_view{{input}}, {cudf::order::DESCENDING});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}
}

TEST_F(NestedStructTest, SimpleStructsOfListsWithNulls)
{
auto const input = [] {
auto child =
int32s_lists{{{4, 2, null}, null_at(2)}, {2}, {{null, 5}, null_at(0)}, {0, 5}, {4, 1}};
return structs_col{{child}};
}();

{
auto const expected_order = int32s_col{2, 3, 1, 4, 0};
auto const order = cudf::sorted_order(cudf::table_view{{input}});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}

{
auto const expected_order = int32s_col{0, 4, 1, 3, 2};
auto const order = cudf::sorted_order(cudf::table_view{{input}}, {cudf::order::DESCENDING});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}
}

TEST_F(NestedStructTest, StructsHaveListsNoNulls)
{
// Input has equal elements, thus needs to be tested by stable sort.
auto const input = [] {
auto child0 = int32s_lists{{4, 2, 0}, {}, {5}, {4, 1}, {4, 0}, {}, {}};
auto child1 = int32s_col{1, 2, 5, 0, 3, 3, 4};
return structs_col{{child0, child1}};
}();

{
auto const expected_order = int32s_col{1, 5, 6, 4, 3, 0, 2};
auto const order = cudf::stable_sorted_order(cudf::table_view{{input}});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}

{
auto const expected_order = int32s_col{2, 0, 3, 4, 6, 5, 1};
auto const order =
cudf::stable_sorted_order(cudf::table_view{{input}}, {cudf::order::DESCENDING});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}
}

TEST_F(NestedStructTest, StructsHaveListsWithNulls)
{
// Input has equal elements, thus needs to be tested by stable sort.
auto const input = [] {
auto child0 =
int32s_lists{{{4, 2, null}, null_at(2)}, {}, {} /*NULL*/, {5}, {4, 1}, {4, 0}, {}, {}};
auto child1 = int32s_col{{1, 2, null, 5, null, 3, 3, 4}, nulls_at({2, 4})};
return structs_col{{child0, child1}, null_at(2)};
}();

{
auto const expected_order = int32s_col{2, 1, 6, 7, 5, 4, 0, 3};
auto const order = cudf::stable_sorted_order(cudf::table_view{{input}});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}

{
auto const expected_order = int32s_col{3, 0, 4, 5, 7, 6, 1, 2};
auto const order =
cudf::stable_sorted_order(cudf::table_view{{input}}, {cudf::order::DESCENDING});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}
}

TEST_F(NestedStructTest, StructsOfStructsHaveListsNoNulls)
{
// Input has equal elements, thus needs to be tested by stable sort.
auto const input = [] {
auto child0 = [] {
auto child0 = int32s_lists{{4, 2, 0}, {}, {5}, {4, 1}, {4, 0}, {}, {}};
auto child1 = int32s_col{1, 2, 5, 0, 3, 3, 4};
return structs_col{{child0, child1}};
}();
auto child1 = int32s_lists{{4, 2, 0}, {}, {5}, {4, 1}, {4, 0}, {}, {}};
auto child2 = int32s_col{1, 2, 5, 0, 3, 3, 4};
return structs_col{{child0, child1, child2}};
}();

{
auto const expected_order = int32s_col{1, 5, 6, 4, 3, 0, 2};
auto const order = cudf::stable_sorted_order(cudf::table_view{{input}});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}

{
auto const expected_order = int32s_col{2, 0, 3, 4, 6, 5, 1};
auto const order =
cudf::stable_sorted_order(cudf::table_view{{input}}, {cudf::order::DESCENDING});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}
}

TEST_F(NestedStructTest, StructsOfStructsHaveListsWithNulls)
{
// Input has equal elements, thus needs to be tested by stable sort.
auto const input = [] {
auto child0 = [] {
auto child0 =
int32s_lists{{{4, 2, null}, null_at(2)}, {}, {} /*NULL*/, {5}, {4, 1}, {4, 0}, {}, {}};
auto child1 = int32s_col{{1, 2, null, 5, null, 3, 3, 4}, nulls_at({2, 4})};
return structs_col{{child0, child1}, null_at(2)};
}();
auto child1 =
int32s_lists{{{4, 2, null}, null_at(2)}, {}, {} /*NULL*/, {5}, {4, 1}, {4, 0}, {}, {}};
auto child2 = int32s_col{{1, 2, null, 5, null, 3, 3, 4}, nulls_at({2, 4})};
return structs_col{{child0, child1, child2}, null_at(2)};
}();

{
auto const expected_order = int32s_col{2, 1, 6, 7, 5, 4, 0, 3};
auto const order = cudf::stable_sorted_order(cudf::table_view{{input}});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}

{
auto const expected_order = int32s_col{3, 0, 4, 5, 7, 6, 1, 2};
auto const order =
cudf::stable_sorted_order(cudf::table_view{{input}}, {cudf::order::DESCENDING});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_order, order->view());
}
}