-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column indexes to Parquet writer #11302
Conversation
Co-authored-by: Bradley Dice <[email protected]>
fix reduce for aggregating types
@nvdbaranec and @hyperbolic2346 , thanks for the comments! Keep 'em coming! I think I've addressed most of them, but @nvdbaranec could you expand on the comment about calling functions with side effects from CUDF_EXPECTS. I'm happy to change, but just wondering what the downside is. |
If I can speak for him, since he is probably quit for the day, this is a sticky point in general related to macros. In past lives, we had macros like ASSERT(condition, reason), and people would put statements with side-effects into the condition. This worked until ASSERT was compiled to nothing in release builds and then suddenly things stopped working until you tried to debug it. I don't think we actually have a problem here, but my eye twitches when I see it as well. |
fair enough. don't want to cause any stress. I'll start reworking those tomorrow. It's beer-o-clock now :) |
I think I'm done with this round of fixes. Question: are there data types that don't support statistics at all (statistics_chunk.has_minmax is 0)? I don't have any tests for that if there are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Yunsong Wang <[email protected]>
I don't know of any, but I'm not an expert here. :) |
These changes are a huge step in a great direction. I appreciate you taking the time to implement our suggestions. |
@gpucibot merge |
#11302 added `STATISTICS_COLUMN` to the `statistics_freq` enum in libcudf. This adds the same to python. Authors: - Ed Seidl (https://github.com/etseidl) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11453
Adds `statistics_freq::STATISTICS_COLUMN` to list of parquet writer options to benchmark. This should have been included in #11302. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #11955
Closes #9268.
The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.
The current implementation does not include truncation of the statistics as recommended. This will be addressed in a later PR.