-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Audit data-type checking for nested type equality #14527
Comments
Addresses most of #14527. See also #14494. This PR expands the use of `cudf::column_types_equal(lhs, rhs)` and adds new methods `cudf::column_scalar_types_equal`, `cudf::scalar_types_equal`, and `cudf::all_column_types_equal`. These type check functions are now employed throughout the code base instead of raw checks like `a.type() == b.type()` because those do not correctly handle nested types. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Lawrence Mitchell (https://github.com/wence-) - Yunsong Wang (https://github.com/PointKernel) URL: #14531
@wence- I tackled a lot of these in #14531. I updated the developer guide as well: https://github.com/rapidsai/cudf/blob/branch-24.06/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#comparing-data-types However, I don't know of a clear way to enforce that nested type checks follow the prescribed approach. Do you think there is more work we should consider doing before closing this issue? |
I don't think there's anything else to do here. People implementing type-equality checking on columns need to be aware that for nested types you need the columns hanging around, but the dev guide handles that. It might have been expedient when initially implementing nested types to separate the structure from the storage more fully by having a nested column be described by a pair of |
Great. Thanks for your helpful reviews! I’ll close this. |
Is your feature request related to a problem? Please describe.
libcudf columns carry a datatype around, so that you can distinguish between, say, an int32 column and an int64 column. This datatype tag is, however, single level, so if we only have the data type, we can't distinguish between nested types whose top-level type tag is the same (e.g.
list(list(int32))
andlist(int32)
).There are various algorithms where libcudf checks that columns are "the same" type. I believe it is intended that those should throw when two nested types with the same top-level type-tag are passed, but the child type-tags are different. However, many places do not check the nested case and happily proceed as long as the top-level matches.
For one example, see #14494.
Describe the solution you'd like
It should be decided if data type equality in these cases should apply to nested types. If yes, the checking of
left.type() == right.type()
should be replaced by one of the utility type-checking routines that traverses nested children.Describe alternatives you've considered
The above solution has the (possible) disadvantage that once a
data_type
tag is detached from a column, there is no way to check equality in the case of nested types, which means that the checking described above can only be performed with a column to hand. One could imagine changing the type tag definition from (morally):to
capturing the nestedness in the datatype definition itself.
This would have the advantage that we wouldn't have to go through and check equality (or remember to check for nested type equality going forward), but might be too heavyweight.
Additional context
Likely candidate files (via
rg type\(\).*==
and some manual pruning). This is probably an incomplete list.The text was updated successfully, but these errors were encountered: