Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support structs of lists in row lexicographic comparator #13005

Merged
merged 25 commits into from
Apr 10, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 24, 2023

This fixes the lexicographic comparator that cannot handle the input having structs of lists. The new implementation mainly changes the helper functions decompose_structs. In particular:

  • If a structs column has its first child is a lists column, the first column of the result table will no longer be Struct<Struct<...<List<SomeType>...> (i.e., nested structs ultimately having one child).
  • Instead, the first output column will be nested empty structs: Struct<...Struct<>>...>. The innermost child column List<SomeType> is output as the second column in the result table.

Depends on:

Closes #11672.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 24, 2023
@ttnghia ttnghia self-assigned this Mar 24, 2023
@ttnghia ttnghia removed the improvement Improvement / enhancement to an existing function label Mar 24, 2023
@github-actions github-actions bot added the CMake CMake build issue label Mar 24, 2023
@ttnghia ttnghia marked this pull request as ready for review March 24, 2023 16:05
@ttnghia ttnghia requested a review from a team as a code owner March 24, 2023 16:05
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 29, 2023

A NULL in i does not mean a NULLs in S2.

No, if S1 is null then all its descendent must be considered as nulls. This is the property of structs: the inner levels cannot be non-nulls while the parent/higher level is null.

Note that the structs column may be "wrongly" constructed such that such property is not enforced. In such cases, such struct column must be corrected by structs::detail::push_down_nulls. I believe that the experimental comparator already calls that API during preprocessing.

Edit: I was wrong. The lex. comparator preprocessor doesn't call push_down_nulls because it is not needed. If the top level is null, the inner levels will not be considered at all. So whether they are nulls or not doesn't matter.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 29, 2023

For the example you posted, you may get the correct structs column because the make_structs_column API calls structs::detail::superimpose_nulls to nullify the inner levels.

@divyegala
Copy link
Member

divyegala commented Mar 29, 2023

In my example, it's not a top-level NULL though, is it? i and S2 are both children of S1. I'm not saying S1 is NULL, I'm saying i is NULL. Let me try to write an MRE for this, maybe I am wrong entirely.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 29, 2023

For the example above, you will have this table row: [ {null}, {5.5}, 3 ], and depth when stopping at the first column is 1, not 0 because the depth value increases when going down:

.

@divyegala
Copy link
Member

For the example above, you will have this table row: [ {null}, {5.5}, 3 ], and depth when stopping at the first column is 1, not 0 because the depth value increases when going down:

But depth is already returned before that, isn't it?

Regardless, it looks like I am wrong. This test passes:

  auto s1 = []() {
    auto i = cudf::test::fixed_width_column_wrapper<int>{{0, 0}, cudf::test::iterators::nulls_at({0, 1})};

    auto f = cudf::test::fixed_width_column_wrapper<float>{11.5, 5.5};
    auto d = cudf::test::fixed_width_column_wrapper<int>{3, 3};

    auto s2 = cudf::test::structs_column_wrapper{{f, d}};

    return cudf::test::structs_column_wrapper{{i, s2}};
  }();

  cudf::test::print(s1);
  cudf::test::fixed_width_column_wrapper<int32_t> expected{1, 0};

  auto s1_tview = cudf::table_view{{s1}};
  auto got = cudf::sorted_order(s1_tview);
  cudf::test::print(got->view());
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected, got->view());

I just don't understand how since according to the written example S1<i> is of depth 0 and S2<f> is of depth 1.

When you find a NULL in S<i> at depth 0, then S2<f> at depth 1 is supposed to be skipped according to the code, unless the example is marking them wrongly.

Anyway, thank you very much for walking me through this. Tomorrow I'll print the depths vector to see what the actual contents are.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 29, 2023

Initially, the depth array is [0, 1, 2]. When processing the first column, depth = 0 as the initial value. Then ++depth is executed at

on the same column before returning due to null detected. So you will have depth=1 returned.

The returned depth is the depth value at which null is detected, not the original depth value of the top level.

@ttnghia ttnghia changed the base branch from branch-23.04 to branch-23.06 March 31, 2023 16:04
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions to make sure I follow the change.

cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks very close, and I really like the test coverage. @ttnghia is there anything else that you are looking to finish?

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one open request to clarify a comment, otherwise LGTM!

@ttnghia ttnghia requested a review from divyegala April 7, 2023 17:07
cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
temp_col.size(),
temp_col.head(),
temp_col.null_mask(),
UNKNOWN_NULL_COUNT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the null count be known here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be queried by temp_col.null_count() which may trigger a kernel launch (now). Good point, I think with our plan to remove UNKNOWN_NULL_COUNT then using temp_col.null_count() here would avoid modifying it again in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is also addressed in #13102.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 10, 2023

/merge

@rapids-bot rapids-bot bot merged commit f357892 into rapidsai:branch-23.06 Apr 10, 2023
@ttnghia ttnghia deleted the sort_structs_of_lists branch April 11, 2023 02:38
rapids-bot bot pushed a commit that referenced this pull request May 3, 2023
This implements support for lexicographic comparison for lists-of-structs, following the proposed idea in #11222:
 * The child column of the lists-of-structs column is replaced by an integer column of its rank values. 
 * In the cases of comparing two tables, such child columns from both tables are concatenated, ranked, then split back into new child columns to replace the original child columns for each table.

Depends on:
 * #13005

Closes #11222.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Enable structs of lists in the new row lexicographic comparator
4 participants