Support lists of structs in row lexicographic comparator #12953

ttnghia · 2023-03-15T22:02:56Z

This implements support for lexicographic comparison for lists-of-structs, following the proposed idea in #11222:

The child column of the lists-of-structs column is replaced by an integer column of its rank values.
In the cases of comparing two tables, such child columns from both tables are concatenated, ranked, then split back into new child columns to replace the original child columns for each table.

Depends on:

Support structs of lists in row lexicographic comparator #13005

Closes #11222.

Signed-off-by: Nghia Truong <[email protected]>

This reverts commit 3609edf.

Signed-off-by: Nghia Truong <[email protected]>

cpp/src/table/row_operators.cu

vyasr

Took a first pass through today, going to try and review this piecemeal to stop stalling progress so focused on getting a handle on the core algorithms in this review. I have a few concerns about specific aspects, but overall nothing too bad. I feel pretty confident that a couple more rounds of review will get us there.

cpp/src/table/row_operators.cu

vyasr · 2023-04-26T00:05:44Z

cpp/src/table/row_operators.cu

+  // Dense ranks should be used instead of first rank.
+  // Consider this example: `input = [ [{0, "a"}, {3, "c"}], [{0, "a"}, {2, "b"}] ]`.
+  // If first rank is used, `transformed_input = [ [0, 3], [1, 2] ]`. Comparing them will lead
+  // to the result row(0) < row(1) which is incorrect.
+  // With dense rank, `transformed_input = [ [0, 2], [0, 1] ]`, producing correct comparison.


Basically this is just saying that we need exactly equal elements to map to the same rank, right?

Yes.

As a consequence, the maximum value of the rank (minus 1) should be the cardinality of the input data, not the size of the input data.

Assuming that you're invoking cardinality vs size in order to treat this as a set, yes I think we're on the same page.

cpp/src/table/row_operators.cu

vyasr · 2023-04-26T00:31:05Z

cpp/src/table/row_operators.cu

+
+      // Recursively call transformation on the child column.
+      auto [new_child_lhs, new_child_rhs_opt, out_cols_child_lhs, out_cols_child_rhs] =
+        transform_lists_of_structs(child_lhs, child_rhs_opt, column_null_order, stream);


OK yes this kind of recursion makes sense, is what I'm worried might be missing in the struct code path above if there are deeper levels of nesting.

The original input should be passed in decompse_struct before flowing into this function. So basically this function cannot see any structs-of-lists column. It can only see structs-of-structs-of..... of one column of basic types, or lists-of-lists (including nested), or lists-of-structs (including nested).

Ah right and decompose_structs will find structs at any level of nesting, even inside lists, right? That's the key here is ensuring that even when there's mixed nesting every level gets processed correctly (or at least, whatever constitutes the "leaf" nodes according to that function).

decompose_structs will find structs at any level of nesting, even inside lists, right?

Yes. All possible mixed nesting at any level will be processed to make sure this function will not miss any cases.

cpp/src/table/row_operators.cu

vyasr · 2023-04-26T00:58:20Z

cpp/include/cudf/table/experimental/row_operators.cuh

+   * Note that the output of this factory function should not be used in `two_table_comparator` if
+   * the input table contains lists-of-structs. In such cases, please use the overload
+   * `preprocessed_table::create(table_view const&, table_view const&,...)` to preprocess both input
+   * tables at the same time.


Hmm this seems error prone. I want to make sure I understand, I think there are three cases:

Only have a single table and want to use self_comparator: Call preprocessed_table::create(tbl, ...);

Have two tables and want to use two_table_comparator, but neither table has lists of structs: Call preprocessed_table::create(lhs, ...) and preprocessed_table::create(rhs, ...) independently.

Have two tables and want to use two_table_comparator, and the tables may contain lists of structs: Call preprocessed_table::create(lhs, rhs, ...)

Is that right? If so, the differences between cases 2 and 3 seem tricky for callers to remember and it would be nice to avoid that. I don't have any great ideas yet though, without defining a completely different return type I don't see a way to avoid the second case from being used. I just wonder if we should always advise use of the new create and eat the cost to simplify developers' lives.

Agreed. I don't want to encourage independent preprocessing for two-table comparators (equality or lexicographic). Enforcing joint preprocessing at the call site gives us far more control over future changes, and we can always fall back to independent preprocessing internally.

I've just add a comment to encourage using the overload create(lhs, rhs): 3a12c2d.

We should make sure that the two-table overload is being used everywhere in libcudf before merging this. I think that's in-scope for this PR -- or at least an immediate follow-up.

Agreed, let's do that in this PR so that there's no chance of examples of the undesirable pattern lingering that could be copy-pasted.

Unfortunately, that change is somewhat very large, requiring the hash_join class to be reimplemented which is too much diverged from this PR. I plan to have two more separate PRs for doing so instead.

Sorry I was wrong. When I was searching for two_table_comparator, I found a lot of instances and mistakenly considered all of them, including the instances of equality::two_table_comparator.

In fact, libcudf currently has only one instance of lexicographic::two_table_comparator and it doesn't need to be updated (already constructed properly). So this is no longer needed.

bdice

We've discussed a lot of follow-up work for this PR. I don't have a lot more to say about the current state, since we've deferred so much for now, so I am approving. Please make sure all the follow-up steps are organized and documented well.

cpp/src/table/row_operators.cu

vyasr

I've stared at this enough to largely convince myself of the correctness of the algorithm at this point. I have some requests left, but they're mostly around testing or easy code quality improvements. In the interest of unblocking downstream development I am OK with merging this with more of the cleanup left as downstream work as long as it can be done in parallel with some of the feature development happening in other PRs.

vyasr · 2023-05-02T18:25:17Z

cpp/include/cudf/detail/sorting.hpp

+                             order column_order,
+                             null_policy null_handling,
+                             null_order null_precedence,
+                             bool percentage,


Can we go ahead and make this an enum?

If this is just exposing an API that already existed in a source file and changing this would affect other code paths I'm OK punting to a follow-up.

This is just exposing the rank API in detail:: namespace. Changing this would be breaking so let's do it in some follow up PR.

vyasr · 2023-05-02T18:26:07Z

cpp/include/cudf/table/experimental/row_operators.cuh

+   * Note that the output of this factory function should not be used in `two_table_comparator` if
+   * the input table contains lists-of-structs. In such cases, please use the overload
+   * `preprocessed_table::create(table_view const&, table_view const&,...)` to preprocess both input
+   * tables at the same time.


Agreed, let's do that in this PR so that there's no chance of examples of the undesirable pattern lingering that could be copy-pasted.

cpp/include/cudf/table/experimental/row_operators.cuh

vyasr · 2023-05-02T22:41:27Z

cpp/src/table/row_operators.cu

+  // Dense ranks should be used instead of first rank.
+  // Consider this example: `input = [ [{0, "a"}, {3, "c"}], [{0, "a"}, {2, "b"}] ]`.
+  // If first rank is used, `transformed_input = [ [0, 3], [1, 2] ]`. Comparing them will lead
+  // to the result row(0) < row(1) which is incorrect.
+  // With dense rank, `transformed_input = [ [0, 2], [0, 1] ]`, producing correct comparison.


Assuming that you're invoking cardinality vs size in order to treat this as a set, yes I think we're on the same page.

cpp/tests/search/search_list_test.cpp

cpp/tests/sort/sort_nested_types_tests.cpp

cpp/tests/table/experimental_row_operator_tests.cu

cpp/benchmarks/sort/sort_lists.cpp

vyasr

We've pushed a lot of work to follow-ups at this point, so I'm good with merging this so that we can get started on that work.

cpp/tests/search/search_list_test.cpp

vyasr · 2023-05-03T15:01:37Z

cpp/tests/search/search_list_test.cpp

+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_upper_bound, *result_upper_bound, verbosity);
+}
+
+TEST_F(ListBinarySearch, CrazyListTest)


Love the name. Thank you for writing this test! It gives me a lot more confidence that we're handling the arbitrary depth correctly.

ttnghia · 2023-05-03T15:07:34Z

Thanks all for reviewing. I'm merging this now, so can unblock the follow-up work ASAP. The next step will be addressing

~~[FEA] Update code where two_table_comparator constructed from preprocessed_table #13228~~ (no longer needed)
~~[FEA] More optimization for lexicographic comparison of nested types #12932~~ (no longer needed)
[FEA] Refactor transform_lists_of_structs in row_operators.cu #13287

ttnghia · 2023-05-03T15:07:46Z

/merge

ttnghia and others added 30 commits March 2, 2023 15:55

Add tests

ff1bc7e

Signed-off-by: Nghia Truong <[email protected]>

Complete tests

b02abae

Signed-off-by: Nghia Truong <[email protected]>

Disable unsupported conditions

3f6f2f3

Signed-off-by: Nghia Truong <[email protected]>

Reverse row_operator.cu

3609edf

Signed-off-by: Nghia Truong <[email protected]>

Revert "Reverse row_operator.cu"

7aede42

This reverts commit 3609edf.

Update tests

452f1b1

Signed-off-by: Nghia Truong <[email protected]>

Fix offset

ce8f088

Signed-off-by: Nghia Truong <[email protected]>

Change return type to unique_ptr

799f2ae

Adapt with changes

de7437e

Update copyright year

6cc4390

Add more variable

4092aae

Signed-off-by: Nghia Truong <[email protected]>

Implement flattening

bc9ecc4

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'refactor_flatten_columns' into sort_nested_types

c9aaa79

Complete implementation

284a79c

Signed-off-by: Nghia Truong <[email protected]>

Cleanup

c11b6e5

Signed-off-by: Nghia Truong <[email protected]>

Fix tests

6d45f8f

Fix orders

a242dff

Signed-off-by: Nghia Truong <[email protected]>

Fix orders again

8e77216

Signed-off-by: Nghia Truong <[email protected]>

Cleanup

1238853

Signed-off-by: Nghia Truong <[email protected]>

Update tests

ec45e32

Signed-off-by: Nghia Truong <[email protected]>

Add variable storing auxiliary data

71fc858

Signed-off-by: Nghia Truong <[email protected]>

Support lists of structs

c8b0634

Signed-off-by: Nghia Truong <[email protected]>

Fix copyright year

4d35317

Fix null order

c61beef

Merge branch 'branch-23.04' into refactor_flatten_columns

5fe136d

include cleanup for cudf/detail/structs/utilities.hpp

37913be

Cleanup

5ce6f6f

Merge branch 'branch-23.04' into refactor_flatten_columns

32085c8

Fix comments

fdd9b23

Support arbitrary nested input

35b87ba

ttnghia added 2 commits April 24, 2023 06:31

Merge branch 'branch-23.06' into two_tables_nested_types

895e2c5

Merge branch 'branch-23.06' into two_tables_nested_types

8a4f1af

bdice reviewed Apr 25, 2023

View reviewed changes

cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved

bdice reviewed Apr 25, 2023

View reviewed changes

cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved

Rename *ranked_children into *has_ranked_children

3e4b7b2

bdice reviewed Apr 25, 2023

View reviewed changes

cpp/src/table/row_operators.cu Show resolved Hide resolved

bdice reviewed Apr 25, 2023

View reviewed changes

cpp/src/table/row_operators.cu Show resolved Hide resolved

vyasr reviewed Apr 26, 2023

View reviewed changes

ttnghia and others added 4 commits April 25, 2023 19:56

Use enum decompose_lists_column instead of boolean value

eb73c75

Merge branch 'branch-23.06' into two_tables_nested_types

c086579

Add comment

3a12c2d

Fix docs

3f38b82

ttnghia mentioned this pull request Apr 26, 2023

[FEA] Update code where two_table_comparator constructed from preprocessed_table #13228

Closed

ttnghia and others added 4 commits April 26, 2023 17:53

Merge branch 'branch-23.06' into two_tables_nested_types

766127c

Merge branch 'branch-23.06' into two_tables_nested_types

364a312

Merge branch 'branch-23.06' into two_tables_nested_types

b254361

Merge branch 'branch-23.06' into two_tables_nested_types

b0ee15a

bdice approved these changes May 2, 2023

View reviewed changes

cpp/src/table/row_operators.cu Show resolved Hide resolved

vyasr requested changes May 2, 2023

View reviewed changes

ttnghia and others added 5 commits May 2, 2023 20:42

Move constexpr order

a44d75a

Rename fuction and extract check_physical_element_comparator

4fd2ab5

Change unit tests

f0b713d

Add a complex unit test

ed89fc3

Merge branch 'branch-23.06' into two_tables_nested_types

5122a80

ttnghia requested a review from vyasr May 3, 2023 04:32

vyasr approved these changes May 3, 2023

View reviewed changes

rapids-bot bot merged commit d0a7dec into rapidsai:branch-23.06 May 3, 2023

ttnghia deleted the two_tables_nested_types branch May 3, 2023 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support lists of structs in row lexicographic comparator #12953

Support lists of structs in row lexicographic comparator #12953

ttnghia commented Mar 15, 2023 •

edited

Loading

vyasr left a comment

vyasr Apr 26, 2023

bdice Apr 26, 2023 •

edited

Loading

vyasr May 2, 2023

vyasr Apr 26, 2023

ttnghia Apr 26, 2023 •

edited

Loading

vyasr May 2, 2023

ttnghia May 3, 2023

vyasr Apr 26, 2023

bdice Apr 26, 2023

ttnghia Apr 26, 2023

bdice May 2, 2023

vyasr May 2, 2023

ttnghia May 3, 2023

ttnghia May 4, 2023

bdice left a comment

vyasr left a comment

vyasr May 2, 2023

vyasr May 2, 2023

ttnghia May 3, 2023

vyasr May 2, 2023

vyasr May 2, 2023

vyasr left a comment

vyasr May 3, 2023

ttnghia commented May 3, 2023 •

edited

Loading

ttnghia commented May 3, 2023

Support lists of structs in row lexicographic comparator #12953

Support lists of structs in row lexicographic comparator #12953

Conversation

ttnghia commented Mar 15, 2023 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia commented May 3, 2023 • edited Loading

ttnghia commented May 3, 2023

ttnghia commented Mar 15, 2023 •

edited

Loading

bdice Apr 26, 2023 •

edited

Loading

ttnghia Apr 26, 2023 •

edited

Loading

ttnghia commented May 3, 2023 •

edited

Loading