-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip decode steps in Parquet reader when nullable columns have no nulls #15332
Conversation
@abellina when you have a chance I'd be curious to see how this works with your benchmarks. cc @nvdbaranec |
@etseidl I'll take a look at this and we will try to run with our benchmarks, but I think it might be early next week. I am sorry about the delay in responding, I have to fix some communication settings on my end. I would not block this PR as we will detect regressions quickly in our benchmark runs. |
/ok to test |
The idea and changes look good to me. I would like to see improvement results from benchmarks as well. As always, thanks Ed for a great effort! 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments got left behind, otherwise 💯
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
Apologies for leaving a trailing whitespace in one of the comments. As they say, "this little maneuver will cost us 51 years" |
… to developer negligence
/ok to test |
😬 I'm fine with waiting to see how this works in the spark environment. It's easier to not commit than to have to revert 😄 |
I should have numbers for this today @etseidl sorry for the delay. |
Thanks for testing this @abellina 🙏 |
@etseidl I am not seeing a difference in our tests. I am having trouble getting baselines to be as fast as normal in this environment, so take this with a grain of salt, but if I compare last night's nightly against this branch it's just noise, some are higher some are lower, and the queries I expect to see impact from scan (query9, query88) don't really see a change.
|
Thanks @abellina. Looks like there are no regressions, this is good to merge. |
/merge |
Description
Closes #15266.
Some block and warp operations and some atomics can be avoided during Parquet page decoding when nullable columns are known to not contain any null values. One issue, however, is that since the columns are nullable, the null mask has to be set. My approach in this PR was to initialize nullable column buffers with an
ALL_VALID
mask rather thanALL_NULL
. This will work when nulls are present becausestore_validity()
sets the bitmask to all zeros before ORing in the passedvalid_mask
.Benchmarks modified to not emit nulls showed a good improvement in decoding time for fixed-width data types.
Checklist