Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Audit data-type checking for nested type equality #14527

Closed
32 tasks
wence- opened this issue Nov 29, 2023 · 3 comments
Closed
32 tasks

[FEA] Audit data-type checking for nested type equality #14527

wence- opened this issue Nov 29, 2023 · 3 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Nov 29, 2023

Is your feature request related to a problem? Please describe.

libcudf columns carry a datatype around, so that you can distinguish between, say, an int32 column and an int64 column. This datatype tag is, however, single level, so if we only have the data type, we can't distinguish between nested types whose top-level type tag is the same (e.g. list(list(int32)) and list(int32)).

There are various algorithms where libcudf checks that columns are "the same" type. I believe it is intended that those should throw when two nested types with the same top-level type-tag are passed, but the child type-tags are different. However, many places do not check the nested case and happily proceed as long as the top-level matches.

For one example, see #14494.

Describe the solution you'd like

It should be decided if data type equality in these cases should apply to nested types. If yes, the checking of left.type() == right.type() should be replaced by one of the utility type-checking routines that traverses nested children.

Describe alternatives you've considered

The above solution has the (possible) disadvantage that once a data_type tag is detached from a column, there is no way to check equality in the case of nested types, which means that the checking described above can only be performed with a column to hand. One could imagine changing the type tag definition from (morally):

data TypeTag = List | Struct | Int32 | Int64 | ...

to

data TypeTag a = List a | Struct a | Int32 | Int64 | ...

capturing the nestedness in the datatype definition itself.

This would have the advantage that we wouldn't have to go through and check equality (or remember to check for nested type equality going forward), but might be too heavyweight.

Additional context

Likely candidate files (via rg type\(\).*== and some manual pruning). This is probably an incomplete list.

  • include/cudf/detail/scatter.cuh
  • include/cudf/table/table_view.hpp
  • src/copying/concatenate.cu
  • src/copying/copy.cu
  • src/copying/copy_range.cu
  • src/copying/scatter.cu
  • src/copying/shift.cu
  • src/dictionary/add_keys.cu
  • src/dictionary/detail/concatenate.cu
  • src/dictionary/remove_keys.cu
  • src/dictionary/replace.cu
  • src/dictionary/search.cu
  • src/dictionary/set_keys.cu
  • src/filling/fill.cu
  • src/filling/sequence.cu
  • src/groupby/groupby.cu
  • src/interop/dlpack.cpp
  • src/join/hash_join.cu
  • src/labeling/label_bins.cu
  • src/lists/contains.cu
  • src/lists/sequences.cu
  • src/reductions/reductions.cpp
  • src/reductions/segmented/max.cu
  • src/reductions/segmented/min.cu
  • src/reductions/segmented/reductions.cpp
  • src/replace/clamp.cu
  • src/replace/nulls.cu
  • src/replace/replace.cu
  • src/rolling/detail/lead_lag_nested.cuh
  • src/rolling/detail/range_window_bounds.hpp
  • src/search/contains_scalar.cu
  • src/transform/one_hot_encode.cu
@wence- wence- added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Nov 29, 2023
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Dec 14, 2023
@GregoryKimball GregoryKimball moved this to In progress in libcudf Dec 14, 2023
rapids-bot bot pushed a commit that referenced this issue May 2, 2024
Addresses most of #14527. See also #14494.

This PR expands the use of `cudf::column_types_equal(lhs, rhs)` and adds new methods `cudf::column_scalar_types_equal`, `cudf::scalar_types_equal`, and `cudf::all_column_types_equal`.

These type check functions are now employed throughout the code base instead of raw checks like `a.type() == b.type()` because those do not correctly handle nested types.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Lawrence Mitchell (https://github.com/wence-)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #14531
@bdice
Copy link
Contributor

bdice commented May 6, 2024

@wence- I tackled a lot of these in #14531. I updated the developer guide as well: https://github.com/rapidsai/cudf/blob/branch-24.06/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#comparing-data-types

However, I don't know of a clear way to enforce that nested type checks follow the prescribed approach. Do you think there is more work we should consider doing before closing this issue?

@wence-
Copy link
Contributor Author

wence- commented May 7, 2024

@wence- I tackled a lot of these in #14531. I updated the developer guide as well: https://github.com/rapidsai/cudf/blob/branch-24.06/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#comparing-data-types

However, I don't know of a clear way to enforce that nested type checks follow the prescribed approach. Do you think there is more work we should consider doing before closing this issue?

I don't think there's anything else to do here. People implementing type-equality checking on columns need to be aware that for nested types you need the columns hanging around, but the dev guide handles that.

It might have been expedient when initially implementing nested types to separate the structure from the storage more fully by having a nested column be described by a pair of (nested_datatype, std::vector<column>), so that determining type equality can just traverse the datatype, but I think the (marginal) gain that might have is not worth it now.

@bdice
Copy link
Contributor

bdice commented May 7, 2024

Great. Thanks for your helpful reviews! I’ll close this.

@bdice bdice closed this as completed May 7, 2024
@bdice bdice removed the 2 - In Progress Currently a work in progress label May 7, 2024
@GregoryKimball GregoryKimball removed this from libcudf Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants