You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is farily common to have data with no null values (all valid). However, the current builders such as StringBuilder always use a BooleanBufferBuilder (source) which allocate and manage a BooleanBuffer even if there are no actual nulls
This is an important enough special case that I think @JasonLi-cn added a special StringArrayBuilder in DataFusion which has a special case for no nulls
This is bad both due to the maintenance overhead of a second near identical copy with a bunch of unsafe (leading to potential security issues, such as this and this)
The only difference is that it has an optimization for building a boolean buffer
Describe the solution you'd like
I would like to avoid the code duplication in DataFusion by avoiding the null buffer creation for data without nulls.
I think doing this optimization in the arrow-rs level would benefit many usecases without nulls
Describe alternatives you've considered
@domodwyer implemented a strategy for this idea in InfluxDB IOx in a clever and elegant way in InfluxDB IOx. The high level idea is to always update length but only allocate/resize buffers when a non null (valid) value pushed
In the internal implementation, this looks like:
pubfnpush(&mutself,bit:Value){// Only NULL values require modifying the bitmap.if bit == Value::Null{self.append_null();}// Advance the bitmap length.self.len += 1;}/// Consume this [`NullMaskBuilder`] and return the constructed null mask/// bitmap.////// NOTE: this may have length of 0 if only non-null values were pushed.pub(super)fnfinish(self) -> Vec<u8>{self.bitmap}
So if only nulls are pushed, the output vector is entirely empty (and thus no allocations occured)
While arrow uses a validation mask (not a null mask) so we would have to invert the logic, I think the same basic idea would apply:
I might be missing something but the builders use NullBufferBuilder which already implements the described optimisation. This was added in #2127 about 2 years ago...
I might be missing something but the builders use NullBufferBuilder which already implements the described optimisation. This was added in #2127 about 2 years ago...
I double checked, I agree with your assessment. I have made a PR to improve the docs for next time hopefully:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is farily common to have data with no null values (all valid). However, the current builders such as
StringBuilder
always use a BooleanBufferBuilder (source) which allocate and manage aBooleanBuffer
even if there are no actual nullsThis is an important enough special case that I think @JasonLi-cn added a special
StringArrayBuilder
in DataFusion which has a special case for no nullshttps://github.com/apache/datafusion/blob/63b94c8f9e128b938e81b7e867ce6256a94d67e6/datafusion/functions/src/strings.rs#L62-L68
This is bad both due to the maintenance overhead of a second near identical copy with a bunch of
unsafe
(leading to potential security issues, such as this and this)The only difference is that it has an optimization for building a boolean buffer
Describe the solution you'd like
I would like to avoid the code duplication in DataFusion by avoiding the null buffer creation for data without nulls.
I think doing this optimization in the arrow-rs level would benefit many usecases without nulls
Describe alternatives you've considered
@domodwyer implemented a strategy for this idea in InfluxDB IOx in a clever and elegant way in InfluxDB IOx. The high level idea is to always update length but only allocate/resize buffers when a non null (valid) value pushed
In the internal implementation, this looks like:
So if only nulls are pushed, the output vector is entirely empty (and thus no allocations occured)
While arrow uses a validation mask (not a null mask) so we would have to invert the logic, I think the same basic idea would apply:
Basically we could update the append (and related functions) to only advance the buffer when a null was appended and avoid the potential resize here if all the data was only null:
https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#88
We would then have to change how append and other functions worked
https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#138
Additional context
The text was updated successfully, but these errors were encountered: