Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NullBufferBuilder instead of BooleanBufferBuilder for creating Null masks #14115

Closed
alamb opened this issue Jan 13, 2025 · 12 comments · Fixed by #14181, #14183, #14201, #14321 or #14325
Closed

Use NullBufferBuilder instead of BooleanBufferBuilder for creating Null masks #14115

alamb opened this issue Jan 13, 2025 · 12 comments · Fixed by #14181, #14183, #14201, #14321 or #14325
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers performance Make DataFusion faster

Comments

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

Is your feature request related to a problem or challenge?

DataFusion uses BooleanBuffer in several places to create Null buffers. I thought there was a clever optimization for handling data with no nulls which I filed in arrow-rs

However, @tustvold pointed out that NullBufferBuilder has exactly the optimization described:

I looked at the DataFusion codebase and found we have several examples of using BooleanBufferBuilder rather than NullBufferBuilder:

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20BooleanBufferBuilder&type=code

It even has a reimplementation of the NullBufferBuilder optimization 🤦 :

/// Builder for an (optional) null mask
///
/// Optimized for avoid creating the bitmask when all values are non-null
#[derive(Debug)]
pub(crate) enum MaybeNullBufferBuilder {
/// seen `row_count` rows but no nulls yet
NoNulls { row_count: usize },
/// have at least one null value
///
/// Note this is an Arrow *VALIDITY* buffer (so it is false for nulls, true
/// for non-nulls)
Nulls(BooleanBufferBuilder),
}

Describe the solution you'd like

I would like to switch DataFusion to using NullBufferBuilder instead of BooleanBufferBuilder as much as possible

Note that until the following PR is availble, this will involve adding an explicit dependency on arrow_buffer

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added enhancement New feature or request performance Make DataFusion faster good first issue Good for newcomers labels Jan 13, 2025
@alamb
Copy link
Contributor Author

alamb commented Jan 13, 2025

I think this would be a good first issue -- as it relatively straightforward

For exmaple, a good way to start might be to find the use of BooleanBufferBuilder in one file such as correlation.rs:

let mut nulls = BooleanBufferBuilder::new(n);

And just replace its use with NullBufferBuilder

This alone might improve performance noticably

@Chen-Yuan-Lai
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2025

BTW @Chen-Yuan-Lai I very much suggest doing this task as multiple smaller PRs if possible

(e.g. make a PR to replace the use in Correlation, make another PR to replace a single other use)

That will help us review the code more quickly and avoid blocking some of the changes if we hit some roadblock to one

@Chen-Yuan-Lai
Copy link
Contributor

@alamb Sure, thanks for the suggestion

@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2025

Maybe once we have done one or two PRs we can use them as examples and file tickets for the other uses (to do the work in parallel). It is going to be awesome

@alamb
Copy link
Contributor Author

alamb commented Jan 15, 2025

I looked a little at this usage:

/// Builder for an (optional) null mask
///
/// Optimized for avoid creating the bitmask when all values are non-null
#[derive(Debug)]
pub(crate) enum MaybeNullBufferBuilder {
/// seen `row_count` rows but no nulls yet
NoNulls { row_count: usize },
/// have at least one null value
///
/// Note this is an Arrow *VALIDITY* buffer (so it is false for nulls, true
/// for non-nulls)
Nulls(BooleanBufferBuilder),
}

It turns out NullBuferBuilder would likely need to provide acess to the inner builder so we could implement is_null

@Chen-Yuan-Lai
Copy link
Contributor

Thanks @alamb for pointing out the detail. I will take it into consideration.

@Chen-Yuan-Lai
Copy link
Contributor

@alamb It seems there are some remaining examples. And I'm working on it, should we reopen the issue 😆 ?

@Chen-Yuan-Lai
Copy link
Contributor

Chen-Yuan-Lai commented Jan 20, 2025

@alamb I found that NullBufferBuilder doesn't have public methods for directly accessing the inner builder (only has as_slice() that returns the inner builder as a slice), so I can't modify the inner builder like trucate() in take_n(). Do you have any suggestions for how I might work around this limitation? 🙏

pub fn take_n(&mut self, n: usize) -> Option<NullBuffer> {
match self {
Self::NoNulls { row_count } => {
*row_count -= n;
None
}
Self::Nulls(builder) => {
// Copy over the values at n..len-1 values to the start of a
// new builder and leave it in self
//
// TODO: it would be great to use something like `set_bits` from arrow here.
let mut new_builder = BooleanBufferBuilder::new(builder.len());
for i in n..builder.len() {
new_builder.append(builder.get_bit(i));
}
std::mem::swap(&mut new_builder, builder);
// take only first n values from the original builder
new_builder.truncate(n);
Some(NullBuffer::from(new_builder.finish()))
}
}
}

@alamb
Copy link
Contributor Author

alamb commented Jan 20, 2025

@alamb It seems there are some remaining examples. And I'm working on it, should we reopen the issue 😆 ?

Yes, done, sorry -- I think github automatically closed it on me

@alamb alamb reopened this Jan 20, 2025
@alamb
Copy link
Contributor Author

alamb commented Jan 20, 2025

@alamb I found that NullBufferBuilder doesn't have public methods for directly accessing the inner builder (only has as_slice() that returns the inner builder as a slice), so I can't modify the inner builder like trucate() in take_n(). Do you have any suggestions for how I might work around this limitation? 🙏

I think take_n is a pretty specialized function and should probably stay in DataFusion (unless someone wants to optimize it upstream in arrow)

What I suggest is wrapping the NullBufferBuilder like this:

pub(crate) struct MaybeNullBufferBuilder(NullBufferBulder);

impl MaybeNullBufferBuilder {
...
    pub fn is_null(&self, row: usize) -> bool {
     // call inner NullBufferBuilder method
     self.0.get_bit(row) == 0
     }
...
}

In order to do this you will need to ensure NullBufferBuilder has all the required methods which will require an upstream PR to arrow-rs (for example to add get_bit and capacity)

Does that make sense @Chen-Yuan-Lai ?

@Chen-Yuan-Lai
Copy link
Contributor

Chen-Yuan-Lai commented Jan 21, 2025

Oh! I got it, I wiil create an issue for implementing some required methods in NullBufferBuilder. Thanks @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment