Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor
contains_table
with cuco::static_set #14064Refactor
contains_table
with cuco::static_set #14064Changes from 8 commits
8f2294f
f2fd994
6c85572
40dfead
bec3cd6
5ada788
204bd45
6ae6f2a
9c7f4f6
237fd70
7283dd3
0a8b678
93ba686
eca017f
fc0980f
b2934e5
218ab4f
6fcaa46
9553104
0b67f9b
71a8793
cd75057
e1125c3
37f7048
4f6af5d
de99c48
cb9614d
b848ada
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's worth the effort. I assume we shouldn't rely on any manual alignment for code formatting/linting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual alignment is for dev reading, to enhance aesthetics/readability, not for the generated docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of alignment, I think all lines should start with a single space and an asterisk.
https://github.com/rapidsai/cudf/blob/branch-23.10/cpp/doxygen/developer_guide/DOCUMENTATION.md#block-comments
I think this should apply to blank lines within the block comment as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch. Definitely yes, something went wrong with my IDE formatting. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case I'm interested in investigating in the other thread. Is the
<false, true>
dispatch providing special functionality or performance advantages that can't be achieved with 2 dispatches (<true, true>
or<false, false>
)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see your point.
Ideally, users should always check nested against nested. Having the
<false, true>
dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case), and the haystack set insertion would be faster than the<true, true>
dispatch.For my education, in this particular case, I assume
<true, true>
would work fine but less efficiently since we do redundant nested-related checks for a flat column. But would<false, false>
produce proper results?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with @divyegala, this will cause runtime errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding the problem -- but it seems like a nested needle could never be found in a haystack without nesting. Shouldn't that fail because the types of the needle and haystack don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. The current
contains_table
allows this combination and returns 0 in this case. I'm trying to align with the existing behavior. Maybe @ttnghia knows more about what is the expected behavior here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the thing: when hashing the column with nulls, the validity (true/false) of the column will be hashed into the output hash:
When using hash table (map/set), we must ensure that the hash values for rows in both tables are consistent. That means, if one side table has nulls and hashed with nulls, the other side must also be hashed with the same
has_nulls
. Otherwise, one table is hashed differently than the other and you will not find the correct "contains" output. That's why you see the hashers are constructed usinghas_any_nulls
.And that is only relevant to hashing.
For equality comparator only (without hashing), we can use different
has_nulls
for tables.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a different issue. Null handling is properly propagated to both self and two-table comparators.
The focus was whether it's legit to have mismatched column types between haystack and needles. After checking the public
contains
API, I think the answer is clearer now: I've updated the code so now it will throw if column types mismatch. The implementation is therefore simplified as @bdice suggested. Nice catch, thanks! 🙏