Add column indexes to Parquet writer #11302

etseidl · 2022-07-19T21:31:42Z

Closes #9268.

The column indexes are actually two different structures. The column index itself which is essentially per-page min/max statistics, and the offset index which stores each page's location, compressed size, and first row index. Since the column index contains information already in the EncColumnChunk structure, I calculate and encode the column index per chunk on device, storing the result in a blob I added to the EncColumnChunk struct. The offset index requires information available only after writing the file, so it is created on the CPU and stored in the aggregate_writer_metadata struct. The indexes themselves are then written to the file before the footer.

The current implementation does not include truncation of the statistics as recommended. This will be addressed in a later PR.

…rquet

…ure/colidx

…eature/colidx

Co-authored-by: Bradley Dice <[email protected]>

type to bool

fix reduce for aggregating types

valid input

…eature/colidx

etseidl · 2022-07-22T01:14:49Z

@nvdbaranec and @hyperbolic2346 , thanks for the comments! Keep 'em coming!

I think I've addressed most of them, but @nvdbaranec could you expand on the comment about calling functions with side effects from CUDF_EXPECTS. I'm happy to change, but just wondering what the downside is.

hyperbolic2346 · 2022-07-22T01:17:31Z

could you expand on the comment about calling functions with side effects from CUDF_EXPECTS. I'm happy to change, but just wondering what the downside is.

If I can speak for him, since he is probably quit for the day, this is a sticky point in general related to macros. In past lives, we had macros like ASSERT(condition, reason), and people would put statements with side-effects into the condition. This worked until ASSERT was compiled to nothing in release builds and then suddenly things stopped working until you tried to debug it.

I don't think we actually have a problem here, but my eye twitches when I see it as well.

etseidl · 2022-07-22T01:20:44Z

fair enough. don't want to cause any stress. I'll start reworking those tomorrow. It's beer-o-clock now :)

etseidl · 2022-07-22T23:14:28Z

I think I'm done with this round of fixes. Question: are there data types that don't support statistics at all (statistics_chunk.has_minmax is 0)? I don't have any tests for that if there are.

PointKernel

LGTM

cpp/src/io/parquet/page_enc.cu

Co-authored-by: Yunsong Wang <[email protected]>

hyperbolic2346 · 2022-07-26T18:34:12Z

I think I'm done with this round of fixes. Question: are there data types that don't support statistics at all (statistics_chunk.has_minmax is 0)? I don't have any tests for that if there are.

I don't know of any, but I'm not an expert here. :)

hyperbolic2346 · 2022-07-26T18:51:35Z

These changes are a huge step in a great direction. I appreciate you taking the time to implement our suggestions.

vuule · 2022-07-26T18:56:39Z

@gpucibot merge

#11302 added `STATISTICS_COLUMN` to the `statistics_freq` enum in libcudf. This adds the same to python. Authors: - Ed Seidl (https://github.com/etseidl) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11453

Adds `statistics_freq::STATISTICS_COLUMN` to list of parquet writer options to benchmark. This should have been included in #11302. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #11955

etseidl and others added 30 commits June 29, 2022 16:46

fix CheckPageRows to use datasources

7892c5a

add thrift support for parquet column and offset indexes

2ed90a0

fix a bug in writing of min/max statistics for decimal128 types in pa…

5303443

…rquet

forgot to replace one fp_scratch

7349adb

Merge remote-tracking branch 'origin/feature/parquet-serde' into feat…

617faf3

…ure/colidx

Merge remote-tracking branch 'origin/feature/decimal128_stats' into f…

8fba754

…eature/colidx

modify parquet writer to add column indexes

2a77e5b

change scratch to void*

65ea003

fix suggested by reviewer

6ef2f2b

Co-authored-by: Bradley Dice <[email protected]>

better documentation for read_footer function, and change return

80ec547

type to bool

change typing in aggregation_type to match extrema_type

c4f0f9c

fix reduce for aggregating types

Merge branch 'rapidsai:branch-22.08' into feature/colidx

1591bdd

update copyright

7e8d038

Merge branch 'rapidsai:branch-22.08' into feature/colidx

f142141

use CUDF_EXPECTS rather than the EXPECT_XX macros when testing for

b88807e

valid input

Merge remote-tracking branch 'origin/feature/11038' into feature/colidx

edb7f86

switch to using CUDF_EXPECTS more often, clean up debug statements

eed2920

add some comments to test code

ba6b9ac

Merge branch 'rapidsai:branch-22.08' into feature/colidx

47de717

Merge remote-tracking branch 'origin/feature/decimal128_stats' into f…

06822a6

…eature/colidx

Merge branch 'rapidsai:branch-22.08' into feature/colidx

5c4b50e

Merge branch 'rapidsai:branch-22.08' into feature/colidx

646135b

Merge branch 'rapidsai:branch-22.08' into feature/colidx

646d934

Merge branch 'branch-22.08' into feature/colidx

ef3997f

formatting

2133fda

add read_column_index and read_offset_index methods

e1f451c

add read_page_header

18f041b

add function to parse statistics

b330680

delete commented out line

562bf89

Merge branch 'rapidsai:branch-22.08' into feature/colidx

aae5aa9

etseidl added 6 commits July 22, 2022 07:50

do not call functions with side effects from macros

7cac483

refactor get_extremum

383925b

pass statistics_dtype rather than uint8_t

d814f6b

change converted_type to have enum type in parquet_column_device_view

f74d185

add more consts

82df8a9

add a little more clarification to column_index_buffer_size()

c6f3750

etseidl requested a review from nvdbaranec July 24, 2022 17:30

PointKernel approved these changes Jul 25, 2022

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

make compare constexpr per suggestion

17b3389

Co-authored-by: Yunsong Wang <[email protected]>

vuule requested a review from hyperbolic2346 July 25, 2022 21:31

nvdbaranec approved these changes Jul 26, 2022

View reviewed changes

hyperbolic2346 approved these changes Jul 26, 2022

View reviewed changes

rapids-bot bot merged commit 96f747b into rapidsai:branch-22.08 Jul 26, 2022

etseidl deleted the feature/colidx branch July 28, 2022 19:53

etseidl mentioned this pull request Aug 3, 2022

Add control of Parquet column index creation to python #11453

Merged

3 tasks

etseidl mentioned this pull request Aug 30, 2022

[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626

Closed

etseidl mentioned this pull request Oct 20, 2022

Add full page indexes to Parquet writer benchmarks #11955

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column indexes to Parquet writer #11302

Add column indexes to Parquet writer #11302

etseidl commented Jul 19, 2022

etseidl commented Jul 22, 2022

hyperbolic2346 commented Jul 22, 2022 •

edited

Loading

etseidl commented Jul 22, 2022

etseidl commented Jul 22, 2022

PointKernel left a comment

hyperbolic2346 commented Jul 26, 2022

hyperbolic2346 commented Jul 26, 2022

vuule commented Jul 26, 2022

Add column indexes to Parquet writer #11302

Add column indexes to Parquet writer #11302

Conversation

etseidl commented Jul 19, 2022

etseidl commented Jul 22, 2022

hyperbolic2346 commented Jul 22, 2022 • edited Loading

etseidl commented Jul 22, 2022

etseidl commented Jul 22, 2022

PointKernel left a comment

Choose a reason for hiding this comment

hyperbolic2346 commented Jul 26, 2022

hyperbolic2346 commented Jul 26, 2022

vuule commented Jul 26, 2022

hyperbolic2346 commented Jul 22, 2022 •

edited

Loading