-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NullBufferBuilder
instead of BooleanBufferBuilder
for creating Null masks
#14115
Comments
I think this would be a good first issue -- as it relatively straightforward For exmaple, a good way to start might be to find the use of
And just replace its use with This alone might improve performance noticably |
take |
BTW @Chen-Yuan-Lai I very much suggest doing this task as multiple smaller PRs if possible (e.g. make a PR to replace the use in Correlation, make another PR to replace a single other use) That will help us review the code more quickly and avoid blocking some of the changes if we hit some roadblock to one |
@alamb Sure, thanks for the suggestion |
Maybe once we have done one or two PRs we can use them as examples and file tickets for the other uses (to do the work in parallel). It is going to be awesome |
I looked a little at this usage: datafusion/datafusion/physical-plan/src/aggregates/group_values/null_builder.rs Lines 20 to 32 in 63b94c8
It turns out |
Thanks @alamb for pointing out the detail. I will take it into consideration. |
@alamb It seems there are some remaining examples. And I'm working on it, should we reopen the issue 😆 ? |
@alamb I found that datafusion/datafusion/physical-plan/src/aggregates/group_values/null_builder.rs Lines 110 to 132 in 63b94c8
|
Yes, done, sorry -- I think github automatically closed it on me |
I think What I suggest is wrapping the NullBufferBuilder like this: pub(crate) struct MaybeNullBufferBuilder(NullBufferBulder);
impl MaybeNullBufferBuilder {
...
pub fn is_null(&self, row: usize) -> bool {
// call inner NullBufferBuilder method
self.0.get_bit(row) == 0
}
...
} In order to do this you will need to ensure NullBufferBuilder has all the required methods which will require an upstream PR to arrow-rs (for example to add Does that make sense @Chen-Yuan-Lai ? |
Oh! I got it, I wiil create an issue for implementing some required methods in |
Is your feature request related to a problem or challenge?
DataFusion uses
BooleanBuffer
in several places to create Null buffers. I thought there was a clever optimization for handling data with no nulls which I filed in arrow-rsBooleanBufferBuilder
for non nullable columns arrow-rs#6973However, @tustvold pointed out that
NullBufferBuilder
has exactly the optimization described:I looked at the DataFusion codebase and found we have several examples of using BooleanBufferBuilder rather than NullBufferBuilder:
https://github.com/search?q=repo%3Aapache%2Fdatafusion%20BooleanBufferBuilder&type=code
It even has a reimplementation of the NullBufferBuilder optimization 🤦 :
datafusion/datafusion/physical-plan/src/aggregates/group_values/null_builder.rs
Lines 20 to 32 in 63b94c8
Describe the solution you'd like
I would like to switch DataFusion to using
NullBufferBuilder
instead ofBooleanBufferBuilder
as much as possibleNote that until the following PR is availble, this will involve adding an explicit dependency on
arrow_buffer
NullBufferBuilder
in the arrow crate arrow-rs#6975Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: