Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BooleanBufferBuilder for non nullable columns #6973

Closed
alamb opened this issue Jan 13, 2025 · 3 comments
Closed

Optimize BooleanBufferBuilder for non nullable columns #6973

alamb opened this issue Jan 13, 2025 · 3 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It is farily common to have data with no null values (all valid). However, the current builders such as StringBuilder always use a BooleanBufferBuilder (source) which allocate and manage a BooleanBuffer even if there are no actual nulls

This is an important enough special case that I think @JasonLi-cn added a special StringArrayBuilder in DataFusion which has a special case for no nulls

https://github.com/apache/datafusion/blob/63b94c8f9e128b938e81b7e867ce6256a94d67e6/datafusion/functions/src/strings.rs#L62-L68

This is bad both due to the maintenance overhead of a second near identical copy with a bunch of unsafe (leading to potential security issues, such as this and this)

The only difference is that it has an optimization for building a boolean buffer

Describe the solution you'd like
I would like to avoid the code duplication in DataFusion by avoiding the null buffer creation for data without nulls.

I think doing this optimization in the arrow-rs level would benefit many usecases without nulls

Describe alternatives you've considered

@domodwyer implemented a strategy for this idea in InfluxDB IOx in a clever and elegant way in InfluxDB IOx. The high level idea is to always update length but only allocate/resize buffers when a non null (valid) value pushed

In the internal implementation, this looks like:

    pub fn push(&mut self, bit: Value) {
        // Only NULL values require modifying the bitmap.
        if bit == Value::Null {
            self.append_null();
        }

        // Advance the bitmap length.
        self.len += 1;
    }

    /// Consume this [`NullMaskBuilder`] and return the constructed null mask
    /// bitmap.
    ///
    /// NOTE: this may have length of 0 if only non-null values were pushed.
    pub(super) fn finish(self) -> Vec<u8> {
        self.bitmap
    }

So if only nulls are pushed, the output vector is entirely empty (and thus no allocations occured)

While arrow uses a validation mask (not a null mask) so we would have to invert the logic, I think the same basic idea would apply:

Basically we could update the append (and related functions) to only advance the buffer when a null was appended and avoid the potential resize here if all the data was only null:
https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#88

We would then have to change how append and other functions worked

https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#138

Additional context

@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog arrow Changes to the arrow crate labels Jan 13, 2025
@alamb alamb changed the title Optimize BooleanBufferBuilder for non nullable columns Optimize BooleanBufferBuilder for non nullable columns Jan 13, 2025
@tustvold
Copy link
Contributor

tustvold commented Jan 13, 2025

I might be missing something but the builders use NullBufferBuilder which already implements the described optimisation. This was added in #2127 about 2 years ago...

@alamb
Copy link
Contributor Author

alamb commented Jan 13, 2025

I might be missing something but the builders use NullBufferBuilder which already implements the described optimisation. This was added in #2127 about 2 years ago...

I double checked, I agree with your assessment. I have made a PR to improve the docs for next time hopefully:

Thank you for the spot

@alamb
Copy link
Contributor Author

alamb commented Jan 13, 2025

I found one reason maybe that i couldn't find this before:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants