-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GroupColumn
support for StringView
/ ByteView
(faster grouping performance)
#12809
Conversation
ac96b5d
to
4842965
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- I am going to try and hook it up and write a few tests
Thanks @Rachelint
Thanks @alamb , I am working on implementing the rest main function |
The rest work is to add tests. |
Amazing @Rachelint -- thank you -- I actually hacked a bit on it too on a plane ride -- I pushed what I had here: #12883 Maybe you can use / repurpose the tests. I'll try and find time to review this weekend, but I may not have as much time as normal |
Thanks, it helps much! |
It is close to be ready, let's add more unit testcases before. |
5221185
to
ab4c198
Compare
Do not re-validate output is utf8
Comments about readability improvement are fixed. I am adding test for better test coverage. |
5fed4eb
to
8348024
Compare
68b0eba
to
c4d45c7
Compare
@alamb 👍 Thanks for reminding about the test coverage. After checking the codes again more carefully, I found some testcases indeed don't cover code paths as I expected. I have refined the tests for |
I plan to complete my review today (sorry I was out yesterday) |
GroupColumn
support for byte viewGroupColumn
support for StringView
/ ByteView
GroupColumn
support for StringView
/ ByteView
GroupColumn
support for StringView
/ ByteView
(faster grouping performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this again @Rachelint -- this is really really neat and very cool (and fast 🚀 )
Thank you for your contributions to helping this project along. I can't wait to see how fast DataFusion 43.0.0 is on ClickBench
I merged up from main and will plan to merge this PR tomorrow unless there is anyone else who would like time to review FYI @XiangpengHao and @Dandandan and @jayzhan211 |
Really exciting! |
let arr = array.as_byte_view::<B>(); | ||
|
||
// Null row case, set and return | ||
if arr.is_null(row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice in the future to avoid those null checks in GroupColumn
(even if input is nullable field) for batches containing no nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems really make sense.
And I found even for the batches containing some nulls, actually we have checked which rows are really nulls in create_hashes
.
Maybe it is possible that, we reuse this check result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to pull the null / not null check into the caller of this function 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I filed an issue about this, and I am trying the straight way about using null_count
. #12944
Let's see the performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to do the check on the batch (does the batch contain no nulls-> follow the fast path that omits the check), so I think indeed the calling side needs to perform this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly interesting: one of the reasons special casing nulls/no-nulls can be helpful is that it permits better auto vectorization, as we are documenting here: apache/arrow-rs#6554
} | ||
} | ||
|
||
fn equal_to_inner(&self, lhs_row: usize, array: &ArrayRef, rhs_row: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use GenericByteViewArray::compare_unchecked
here?
Related usage: https://github.com/apache/arrow-rs/blob/master/arrow-ord/src/cmp.rs#L568
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we can't use it currently... Because it seems only can accept two GenericByteViewArray
s as input.
pub unsafe fn compare_unchecked(
left: &GenericByteViewArray<T>,
left_idx: usize,
right: &GenericByteViewArray<T>,
right_idx: usize,
) -> std::cmp::Ordering {
But in equal_to_inner
function, one of the input is ByteViewGroupValueBuilder
, and only another is GenericByteViewArray
.
Actually I implement equal_to_inner
by copying and modifying codes from compare_unchecked
.
🤔Maybe we can make compre_unchecked
able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs), and reuse it rather than copying codes in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense 👍
🤔Maybe we can make compre_unchecked able to accept not only ``GenericByteViewArray`(maybe can by defining a new trait for the inputs)?
And reuse it rather than copying codes in future?
I agree, probably in a follow up work
Nice work! looks good to me, left a minor comment |
debug_assert!(value_len > 12); | ||
let require_cap = self.in_progress.len() + value_len; | ||
|
||
// If current block isn't big enough, flush it and create a new in progress block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be If current block is big enough
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
If current block is big enough
?
Maybe can improve like that If current in_progress block have no enough room to hold the new value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done 👍
🚀 |
Which issue does this PR close?
Closes #12771
Rationale for this change
The new column based multi gourp by values impl is proved to be performant, but it is still not supported for byte view column now.
This pr will support this for getting better performance when we enable string view by default.
What changes are included in this PR?
Support new excellent column based multi group values for byte view column.
Are these changes tested?
Yes, test by new unit tests and e2e tests (most of them helped by @alamb )
Are there any user-facing changes?
No.