-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lists of structs in row lexicographic comparator #12953
Support lists of structs in row lexicographic comparator #12953
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This reverts commit 3609edf.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass through today, going to try and review this piecemeal to stop stalling progress so focused on getting a handle on the core algorithms in this review. I have a few concerns about specific aspects, but overall nothing too bad. I feel pretty confident that a couple more rounds of review will get us there.
// Dense ranks should be used instead of first rank. | ||
// Consider this example: `input = [ [{0, "a"}, {3, "c"}], [{0, "a"}, {2, "b"}] ]`. | ||
// If first rank is used, `transformed_input = [ [0, 3], [1, 2] ]`. Comparing them will lead | ||
// to the result row(0) < row(1) which is incorrect. | ||
// With dense rank, `transformed_input = [ [0, 2], [0, 1] ]`, producing correct comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this is just saying that we need exactly equal elements to map to the same rank, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
As a consequence, the maximum value of the rank (minus 1) should be the cardinality of the input data, not the size of the input data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that you're invoking cardinality vs size in order to treat this as a set, yes I think we're on the same page.
|
||
// Recursively call transformation on the child column. | ||
auto [new_child_lhs, new_child_rhs_opt, out_cols_child_lhs, out_cols_child_rhs] = | ||
transform_lists_of_structs(child_lhs, child_rhs_opt, column_null_order, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yes this kind of recursion makes sense, is what I'm worried might be missing in the struct code path above if there are deeper levels of nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original input should be passed in decompse_struct
before flowing into this function. So basically this function cannot see any structs-of-lists column. It can only see structs-of-structs-of..... of one column of basic types, or lists-of-lists (including nested), or lists-of-structs (including nested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right and decompose_structs
will find structs at any level of nesting, even inside lists, right? That's the key here is ensuring that even when there's mixed nesting every level gets processed correctly (or at least, whatever constitutes the "leaf" nodes according to that function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decompose_structs
will find structs at any level of nesting, even inside lists, right?
Yes. All possible mixed nesting at any level will be processed to make sure this function will not miss any cases.
* Note that the output of this factory function should not be used in `two_table_comparator` if | ||
* the input table contains lists-of-structs. In such cases, please use the overload | ||
* `preprocessed_table::create(table_view const&, table_view const&,...)` to preprocess both input | ||
* tables at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this seems error prone. I want to make sure I understand, I think there are three cases:
- Only have a single table and want to use
self_comparator
: Callpreprocessed_table::create(tbl, ...);
- Have two tables and want to use
two_table_comparator
, but neither table has lists of structs: Callpreprocessed_table::create(lhs, ...)
andpreprocessed_table::create(rhs, ...)
independently. - Have two tables and want to use
two_table_comparator
, and the tables may contain lists of structs: Callpreprocessed_table::create(lhs, rhs, ...)
Is that right? If so, the differences between cases 2 and 3 seem tricky for callers to remember and it would be nice to avoid that. I don't have any great ideas yet though, without defining a completely different return type I don't see a way to avoid the second case from being used. I just wonder if we should always advise use of the new create
and eat the cost to simplify developers' lives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I don't want to encourage independent preprocessing for two-table comparators (equality or lexicographic). Enforcing joint preprocessing at the call site gives us far more control over future changes, and we can always fall back to independent preprocessing internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just add a comment to encourage using the overload create(lhs, rhs)
: 3a12c2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure that the two-table overload is being used everywhere in libcudf before merging this. I think that's in-scope for this PR -- or at least an immediate follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's do that in this PR so that there's no chance of examples of the undesirable pattern lingering that could be copy-pasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, that change is somewhat very large, requiring the hash_join
class to be reimplemented which is too much diverged from this PR. I plan to have two more separate PRs for doing so instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was wrong. When I was searching for two_table_comparator
, I found a lot of instances and mistakenly considered all of them, including the instances of equality::two_table_comparator
.
In fact, libcudf currently has only one instance of lexicographic::two_table_comparator
and it doesn't need to be updated (already constructed properly). So this is no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed a lot of follow-up work for this PR. I don't have a lot more to say about the current state, since we've deferred so much for now, so I am approving. Please make sure all the follow-up steps are organized and documented well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've stared at this enough to largely convince myself of the correctness of the algorithm at this point. I have some requests left, but they're mostly around testing or easy code quality improvements. In the interest of unblocking downstream development I am OK with merging this with more of the cleanup left as downstream work as long as it can be done in parallel with some of the feature development happening in other PRs.
order column_order, | ||
null_policy null_handling, | ||
null_order null_precedence, | ||
bool percentage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we go ahead and make this an enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just exposing an API that already existed in a source file and changing this would affect other code paths I'm OK punting to a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just exposing the rank
API in detail::
namespace. Changing this would be breaking so let's do it in some follow up PR.
* Note that the output of this factory function should not be used in `two_table_comparator` if | ||
* the input table contains lists-of-structs. In such cases, please use the overload | ||
* `preprocessed_table::create(table_view const&, table_view const&,...)` to preprocess both input | ||
* tables at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's do that in this PR so that there's no chance of examples of the undesirable pattern lingering that could be copy-pasted.
// Dense ranks should be used instead of first rank. | ||
// Consider this example: `input = [ [{0, "a"}, {3, "c"}], [{0, "a"}, {2, "b"}] ]`. | ||
// If first rank is used, `transformed_input = [ [0, 3], [1, 2] ]`. Comparing them will lead | ||
// to the result row(0) < row(1) which is incorrect. | ||
// With dense rank, `transformed_input = [ [0, 2], [0, 1] ]`, producing correct comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that you're invoking cardinality vs size in order to treat this as a set, yes I think we're on the same page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've pushed a lot of work to follow-ups at this point, so I'm good with merging this so that we can get started on that work.
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_upper_bound, *result_upper_bound, verbosity); | ||
} | ||
|
||
TEST_F(ListBinarySearch, CrazyListTest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the name. Thank you for writing this test! It gives me a lot more confidence that we're handling the arbitrary depth correctly.
Thanks all for reviewing. I'm merging this now, so can unblock the follow-up work ASAP. The next step will be addressing |
/merge |
This implements support for lexicographic comparison for lists-of-structs, following the proposed idea in #11222:
Depends on:
Closes #11222.