Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Provide shallow equality comparison for column_view #9139

Closed
jrhemstad opened this issue Aug 27, 2021 · 0 comments · Fixed by #9185 or #9312
Closed

[FEA] Provide shallow equality comparison for column_view #9139

jrhemstad opened this issue Aug 27, 2021 · 0 comments · Fixed by #9185 or #9312
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Given two arbitrary column_views, I would like to be able to know in constant time if the two objects are views of identical logical columns.

Describe the solution you'd like

In short, I want a is_shallow_equal(column_view, column_view) (name TBD) that returns true if the two views reference the exact same data (i.e., their data/null_mask pointers are equal, they have equal offsets/size/type/num_children, and all of these things apply recursively to all descendants) Otherwise, returns false, even if the two columns view equivalent data.

Additional context

The primary motivation for this function is in the context of groupby where we maintain a cache of already computed aggregations. We would like to be able to unambiguously detect if we have already computed a particular aggregation for a given column. We'll likewise need a hash function that hashes all of the "shallow" state of a column_view.

For a fascinating look into Regularity and why std::span doesn't have an operator==, check out:

https://abseil.io/blog/20180531-regular-types
https://stackoverflow.com/questions/60633668/why-does-stdspan-lack-the-comparison-operators

@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Aug 27, 2021
@jrhemstad jrhemstad added this to the Time Series Analysis milestone Aug 30, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 22, 2021
…view (#9185)

Fixes #9140 
Added `shallow_hash(column_view)`
Added unit tests

It computes hash values based on the shallow states of `column_view`:
type, size, data pointer, null_mask pointer,  offset, and the hash value of the children. 
`null_count` is not used since it is a cached value and it may vary based on contents of `null_mask`, and may be pre-computed or not.

Fixes #9139
Added `is_shallow_equivalent(column_view, column_view)` ~shallow_equal~
Added unit tests

It compares two column_views based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the column_view of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)

URL: #9185
rapids-bot bot pushed a commit that referenced this issue Sep 27, 2021
…view (#9312)

Fixes #9140 
Added `shallow_hash(column_view)`
Added unit tests
SWIPAT approval complete

It computes hash values based on the shallow states of `column_view`:
type, size, data pointer, null_mask pointer,  offset, and the hash value of the children. 
`null_count` is not used since it is a cached value and it may vary based on contents of `null_mask`, and may be pre-computed or not.

Fixes #9139
Added `is_shallow_equivalent(column_view, column_view)` ~shallow_equal~
Added unit tests

It compares two column_views based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the column_view of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment