-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Should datatype equality comparisons traverse nested children? #14494
Comments
The compile-time failure with Since the columns and tables are type-erased, the compiler cannot complain about mismatching join keys: it cannot see that they are mismatching. Is the problem you observe that the test you wrote with |
@ZelboK Can you try and see if the |
@wence-
Does not actually run into any runtime failures. (If it matters) I generally run specific tests this way |
tl;dr: I think this is not raising an error by design, but I am not sure of the history of that design, or whether it is, in fact, desirable behaviour. For the cudf/cpp/src/join/hash_join.cu Lines 557 to 577 in 5e58e71
So, if the join is "trivial" (for example either table is empty for an inner join) we immediately return a null result. Otherwise we check if the
The scale only has any bearing on the result if using a decimal type (which we're not doing here). So two
In any case, the current implementation just looks at the top level type id, and so a column of The story with the semi-joins is the same, though the check happens elsewhere. Since those joins are wrappers around
where The way the nested types are attached to columns, if you just have the
Rather than
The "child" datatypes are only available by traversing the children of a column with a bool has_equal_types_nested(column_view const& lhs, column_view const& rhs)
{
return lhs.type() == rhs.type()
and std::equal(lhs.child_begin(), lhs.child_end(), rhs.child_begin(), rhs.child_end(),
[](auto const& left, auto const& right) { return has_equal_types_nested(left, right); });
} |
Aside, this also means that mismatched types can get through and produce a result if a table is empty where, were it not, we would get a runtime exception being raised. |
Thank you for the explanations! I appreciate it :) I personally believe it would be inappropriate for me to comment on whether or not this design is problematic because I am far too new to libcudf to visualize how much effort is required, if this is problematic, among other things. I mentioned it to @bdice and figured it was worth mentioning in an issue as well. I want to make it clear, I'm not actually blocked by anything. I like to write tests to better understand functionality and encountered this case on the way. |
Thanks, I've made some small edits to the issue to pose it as a question. |
Yes, this is a bug in my view. The design probably predates nested type support. Here, the docs clearly state that types must match (though in a "detail" method): cudf/cpp/include/cudf/detail/join.hpp Line 180 in 8da6204
The bug is here, because this step doesn't look at nested types: cudf/cpp/src/join/hash_join.cu Lines 570 to 575 in 8da6204
There is a file full of utility functions (https://github.com/rapidsai/cudf/blob/branch-24.02/cpp/src/utilities/type_checks.cpp) that can be used for this purpose. |
Addresses most of #14527. See also #14494. This PR expands the use of `cudf::column_types_equal(lhs, rhs)` and adds new methods `cudf::column_scalar_types_equal`, `cudf::scalar_types_equal`, and `cudf::all_column_types_equal`. These type check functions are now employed throughout the code base instead of raw checks like `a.type() == b.type()` because those do not correctly handle nested types. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Lawrence Mitchell (https://github.com/wence-) - Yunsong Wang (https://github.com/PointKernel) URL: #14531
I reviewed the code base and updated the function |
Describe the bug
I have a tendency of running mini "experiments" to better understand functionality behavior through tests. Found some unexpected behavior while writing a test that should (likely?) throw an exception.
EDIT (@wence-):
This doesn't raise because the top-level data type ids match (both the left and right columns are structs). So, question: should datatype equality traverse nested children?
END EDIT
Steps/Code to reproduce bug
Expected behavior
If you run this test with
left_join
, it will return an SFINAE errorWhen I get more time I can look into this more deeply. For now just wanted to mention.
The text was updated successfully, but these errors were encountered: