Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shallow hash function and shallow equality comparison for column_view #9185

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Sep 7, 2021

Fixes #9140
Added shallow_hash(column_view)
Added unit tests

It computes hash values based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the hash value of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.

Fixes #9139
Added is_shallow_equivalent(column_view, column_view) shallow_equal
Added unit tests

It compares two column_views based on the shallow states of column_view:
type, size, data pointer, null_mask pointer, offset, and the column_view of the children.
null_count is not used since it is a cached value and it may vary based on contents of null_mask, and may be pre-computed or not.

@karthikeyann karthikeyann requested a review from a team as a code owner September 7, 2021 18:51
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Sep 7, 2021
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Sep 7, 2021
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Sep 7, 2021
Comment on lines 90 to 91
combine_hash(hash, std::hash<void const*>{}(input.head()));
combine_hash(hash, std::hash<void const*>{}(input.null_mask()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should include the head or the null_mask pointer in the hash when the size is 0. Those pointers could be garbage if the column doesn't have any elements. In fact, you should explicitly add a test for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then slice(col_1, {0,0})[0] == slice(col_2, {0,0})[0]. Is this correct behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your example. Are you saying two columns without any elements will produce the same hash value? I think that's okay. With an empty column, there is no physical column that we're viewing, only a conceptual one. So all empty column_views of the same type conceptually view the same column.

Copy link
Contributor Author

@karthikeyann karthikeyann Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's what I meant.
shallow_hash(slice(column_view(col_1), {0,0})[0]) == shallow_hash(slice(column_view(col_1_copy), {0,0})[0]) fails for nested types only because children sizes may not be zero after slicing even if parent size is zero, and hence child.data is compared and they are different.

    auto col_new        = std::make_unique<cudf::column>(*col);
    auto col_new_view   = col_new->view();
    auto col_sliced     = cudf::slice(col_view, {0, 0, 1, 1, col_view.size(), col_view.size()});
    auto col_new_sliced = cudf::slice(col_new_view, {0, 0, 1, 1, col_view.size(), col_view.size()});
    EXPECT_EQ(shallow_hash(col_sliced[0]), shallow_hash(col_new_sliced[0]));

parent.size() could be propagated to shallow_hash(children) too, but it does not seem right. (propagate one level or all the way to leaf).
We can't simply ignore children altogether because 2 struct column view with different children types should be assumed to have different hash, even both are empty.

Copy link
Contributor

@jrhemstad jrhemstad Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting you ignore children, but instead, ignore the values of data() and null_mask() when the column's size is zero. If two columns c0, c1 both have c0.size() == 0 == c1.size() but they have children with different sizes/types/offsets/etc. then they are not shallow equal. This should be taken care of automatically when you recurse to the children and see that some of the shallow state is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting outcome of offline discussion:

  • for empty column, ignore the children, check types are same, ignore size, offset, pointers.
    only look at the nested data types.
  • call it is_shallow_equivalent instead of is_shallow_equal.

cpp/src/column/column_view.cpp Outdated Show resolved Hide resolved
cpp/tests/column/column_view_shallow_test.cpp Outdated Show resolved Hide resolved
cpp/tests/column/column_view_shallow_test.cpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 18, 2021

Codecov Report

Merging #9185 (e5bfd50) into branch-21.10 (3ee3ecf) will decrease coverage by 0.02%.
The diff coverage is 0.00%.

❗ Current head e5bfd50 differs from pull request most recent head 1a5f367. Consider uploading reports for the commit 1a5f367 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10    #9185      +/-   ##
================================================
- Coverage         10.85%   10.83%   -0.03%     
================================================
  Files               115      116       +1     
  Lines             19158    18781     -377     
================================================
- Hits               2080     2035      -45     
+ Misses            17078    16746     -332     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/text.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
python/custreamz/custreamz/tests/conftest.py 71.42% <0.00%> (-7.15%) ⬇️
python/custreamz/custreamz/tests/test_kafka.py 35.71% <0.00%> (-7.15%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 96.97% <0.00%> (-2.42%) ⬇️
python/dask_cudf/dask_cudf/backends.py 82.31% <0.00%> (-0.43%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f08d6f1...1a5f367. Read the comment docs.

@karthikeyann
Copy link
Contributor Author

added license for SWIPAT.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with David's last suggestion, and I had one minor comment as well, but aside from that this looks good to me now.

cpp/src/column/column_view.cpp Outdated Show resolved Hide resolved
rhs.child_end(),
[is_empty](auto const& lhs_child, auto const& rhs_child) {
return shallow_equivalent_impl(lhs_child, rhs_child, is_empty);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this has already been discussed but why not implement this function like:

bool shallow_equivalent_impl(column_view const& lhs,
                             column_view const& rhs,
                             bool is_parent_empty = false)
{
   return shallow_hash_impl(lhs) == shallow_hash_impl(rhs);
}

Should not the `hash` values and `equivalent` results be consistent anyway? The `hash_combine` function does not look like it would be a significant impact here.

It seems this function is also accessing child size/offset values even if the parent is empty. And keeping the hash and equivalent functions in sync may be challenging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalent cannot use hash because of hash collison. 2 column_view that are not equivalent may end up having same hash. 2 equivalent column_view will have same hash, but vice versa may not be true always.

@karthikeyann karthikeyann removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Sep 21, 2021
@jrhemstad
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1cb527f into rapidsai:branch-21.10 Sep 22, 2021
karthikeyann added a commit that referenced this pull request Sep 23, 2021
rapids-bot bot pushed a commit that referenced this pull request Sep 23, 2021
rapids-bot bot pushed a commit that referenced this pull request Oct 5, 2021
…ed on column_view instead of requests. (#9195)

This PR updates groupby result_cache to use `pair<column_view, aggregation>` as key to unordered_map.
This allows to cache intermediate results based on the column view. So, it is possible to cache children column_view results and can be resused in other aggregation_request.

Depends on #9185
shallow_hash and is_shallow_equivalent are used for column_view.

Additional context:
This change is required to cache children column intermediate results in #9154 and allows to be shared across multiple aggregation requests.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue DO NOT MERGE Hold off on merging; see PR for details feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
6 participants