Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Array::logical_nulls to only copy when necessary #5209

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 14, 2023

Which issue does this PR close?

Closes #5208

Rationale for this change

I would like to avoid copies when unecessary

What changes are included in this PR?

  1. Return Cow<NullBuffer> rather than NullBuffer
  2. Update existing code
  3. Add comments and examples to make what is happening clearer (as I think using Cows requires finagling that is sometimes obtuse)

Are there any user-facing changes?

Yes, this is an API change

I tried to make it as easy as possible with documentation

@alamb alamb added the api-change Changes to the arrow API label Dec 14, 2023
@alamb alamb marked this pull request as ready for review December 14, 2023 15:59
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 14, 2023
pub use binary_array::*;

mod boolean_array;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this re-organisation, it makes it harder to notice if you've missed a re-export

Copy link
Contributor Author

@alamb alamb Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do it on purpose -- my editor must have done it. I will revert it if we proceed with this PR

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if this is showing up in benchmarks as an overhead, a dyn dispatch is already quite expensive. In general something is a bit off if you're calling logical_nulls in a hot loop, as you should probably extract the null mask outside of this loop...

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

🤔 @tustvold and @Dandandan hvae points out that the overhead of copying NullBuffers is probably small

https://github.com/apache/arrow-rs/blob/802ed428f87051fdca31180430ddb0ecb2f60e8b/arrow-buffer/src/buffer/null.rs#L30C15-L33

The size of a NullBuffer is 48 bytes.

    println!("sizeof null buffer: {}", std::mem::size_of::<NullBuffer>());
sizeof null buffer: 48

So this PR would save copying 48 bytes and 1 atomic increment for the common case where the null buffer is not computed

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

I'm curious if this is showing up in benchmarks as an overhead, a dyn dispatch is already quite expensive. In general something is a bit off if you're calling logical_nulls in a hot loop, as you should probably extract the null mask outside of this loop...

No, this is only a theoretical concern

@tustvold
Copy link
Contributor

It is hard for me to approve a breaking change without a strong performance justification. It is also worth noting that Cow adds a branch on access, which might actually be far worse than some atomic increments...

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

It is hard for me to approve a breaking change without a strong performance justification. It is also worth noting that Cow adds a branch on access, which might actually be far worse than some atomic increments...

I agree to merge this we need performance benchmarks. I will try and find time to see if I can find some data one way or the other

@alamb
Copy link
Contributor Author

alamb commented Apr 16, 2024

Sorry for not closing this sooner.

BTW @andygrove added an automatic workflow to close stale PRs (like this one) in DataFusion: apache/datafusion#10046

It may be worth considering something like that for this repo too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid forced copy in Array::logical_nulls
2 participants