-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shallow hash function and shallow equality comparison for column_view #9185
Add shallow hash function and shallow equality comparison for column_view #9185
Conversation
Can one of the admins verify this patch? |
cpp/src/column/column_view.cpp
Outdated
combine_hash(hash, std::hash<void const*>{}(input.head())); | ||
combine_hash(hash, std::hash<void const*>{}(input.null_mask())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should include the head
or the null_mask
pointer in the hash when the size is 0. Those pointers could be garbage if the column doesn't have any elements. In fact, you should explicitly add a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then slice(col_1, {0,0})[0] == slice(col_2, {0,0})[0]
. Is this correct behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your example. Are you saying two columns without any elements will produce the same hash value? I think that's okay. With an empty column, there is no physical column that we're viewing, only a conceptual one. So all empty column_views of the same type conceptually view the same column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's what I meant.
shallow_hash(slice(column_view(col_1), {0,0})[0]) == shallow_hash(slice(column_view(col_1_copy), {0,0})[0])
fails for nested types only because children sizes may not be zero after slicing even if parent size is zero, and hence child.data is compared and they are different.
auto col_new = std::make_unique<cudf::column>(*col);
auto col_new_view = col_new->view();
auto col_sliced = cudf::slice(col_view, {0, 0, 1, 1, col_view.size(), col_view.size()});
auto col_new_sliced = cudf::slice(col_new_view, {0, 0, 1, 1, col_view.size(), col_view.size()});
EXPECT_EQ(shallow_hash(col_sliced[0]), shallow_hash(col_new_sliced[0]));
parent.size()
could be propagated to shallow_hash(children)
too, but it does not seem right. (propagate one level or all the way to leaf).
We can't simply ignore children altogether because 2 struct column view with different children types should be assumed to have different hash, even both are empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting you ignore children, but instead, ignore the values of data()
and null_mask()
when the column's size is zero. If two columns c0, c1
both have c0.size() == 0 == c1.size()
but they have children with different sizes/types/offsets/etc. then they are not shallow equal. This should be taken care of automatically when you recurse to the children and see that some of the shallow state is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commenting outcome of offline discussion:
- for empty column, ignore the children, check types are same, ignore size, offset, pointers.
only look at the nested data types. - call it
is_shallow_equivalent
instead ofis_shallow_equal
.
Co-authored-by: Jake Hemstad <[email protected]>
…/cudf into fea-shallow_equal_columnview
Co-authored-by: Jake Hemstad <[email protected]>
…/cudf into fea-shallow_hash_columnview
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9185 +/- ##
================================================
- Coverage 10.85% 10.83% -0.03%
================================================
Files 115 116 +1
Lines 19158 18781 -377
================================================
- Hits 2080 2035 -45
+ Misses 17078 16746 -332
Continue to review full report at Codecov.
|
added license for SWIPAT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with David's last suggestion, and I had one minor comment as well, but aside from that this looks good to me now.
rhs.child_end(), | ||
[is_empty](auto const& lhs_child, auto const& rhs_child) { | ||
return shallow_equivalent_impl(lhs_child, rhs_child, is_empty); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this has already been discussed but why not implement this function like:
bool shallow_equivalent_impl(column_view const& lhs,
column_view const& rhs,
bool is_parent_empty = false)
{
return shallow_hash_impl(lhs) == shallow_hash_impl(rhs);
}
Should not the `hash` values and `equivalent` results be consistent anyway? The `hash_combine` function does not look like it would be a significant impact here.
It seems this function is also accessing child size/offset values even if the parent is empty. And keeping the hash
and equivalent
functions in sync may be challenging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalent cannot use hash because of hash collison. 2 column_view that are not equivalent may end up having same hash. 2 equivalent column_view will have same hash, but vice versa may not be true always.
Co-authored-by: David Wendt <[email protected]>
@gpucibot merge |
… column_view (#9185)" (#9283) Reverts #9185 More details on PR #9185 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Devavret Makkar (https://github.com/devavret) - Conor Hoekstra (https://github.com/codereport) URL: #9283
…ed on column_view instead of requests. (#9195) This PR updates groupby result_cache to use `pair<column_view, aggregation>` as key to unordered_map. This allows to cache intermediate results based on the column view. So, it is possible to cache children column_view results and can be resused in other aggregation_request. Depends on #9185 shallow_hash and is_shallow_equivalent are used for column_view. Additional context: This change is required to cache children column intermediate results in #9154 and allows to be shared across multiple aggregation requests. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9195
Fixes #9140
Added
shallow_hash(column_view)
Added unit tests
It computes hash values based on the shallow states of
column_view
:type, size, data pointer, null_mask pointer, offset, and the hash value of the children.
null_count
is not used since it is a cached value and it may vary based on contents ofnull_mask
, and may be pre-computed or not.Fixes #9139
Added
is_shallow_equivalent(column_view, column_view)
shallow_equalAdded unit tests
It compares two column_views based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the column_view of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.