[FEA] Add `encoding_stats` footer metadata to the parquet writer #15313

GregoryKimball · 2024-03-14T22:46:03Z

Is your feature request related to a problem? Please describe.

The parquet-cpp-arrow writer includes ColumnChunk encoding_stats after the ColumnChunk statistics in the Parquet file footer. The encoding stats are useful for providing a total page count, tracking RLE_DICTIONARY fallback to PLAIN encoding, and verifying optional V2 encodings such as DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY.

Parquet-tools is a simple command line interface to learn more about a parquet file.

Here is an example of the encoding_stats data from the writer parquet-cpp-arrow version 14.0.2

df = pd.DataFrame({'a': [1,2]})
df.to_parquet('cpp-arrow.pq')

df = cudf.DataFrame({'a': [1,2]})
df.to_parquet('cudf.pq')

parquet-tools inspect --detail cpp-arrow.pq

■■■■■■■■■■■■■■■■ColumnChunk
■■■■■■■■■■■■■■■■■■■■file_offset = 108
■■■■■■■■■■■■■■■■■■■■meta_data = ColumnMetaData
■■■■■■■■■■■■■■■■■■■■■■■■type = 2
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
■■■■■■■■■■■■■■■■■■■■■■■■■■■■3
■■■■■■■■■■■■■■■■■■■■■■■■■■■■8
■■■■■■■■■■■■■■■■■■■■■■■■path_in_schema = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■a
■■■■■■■■■■■■■■■■■■■■■■■■codec = 1
■■■■■■■■■■■■■■■■■■■■■■■■num_values = 2
■■■■■■■■■■■■■■■■■■■■■■■■total_uncompressed_size = 100
■■■■■■■■■■■■■■■■■■■■■■■■total_compressed_size = 104
■■■■■■■■■■■■■■■■■■■■■■■■data_page_offset = 36
■■■■■■■■■■■■■■■■■■■■■■■■dictionary_page_offset = 4
■■■■■■■■■■■■■■■■■■■■■■■■statistics = Statistics
■■■■■■■■■■■■■■■■■■■■■■■■■■■■max = b'\x02\x00\x00\x00\x00\x00\x00\x00'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■min = b'\x01\x00\x00\x00\x00\x00\x00\x00'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■max_value = b'\x02\x00\x00\x00\x00\x00\x00\x00'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■min_value = b'\x01\x00\x00\x00\x00\x00\x00\x00'
■■■■■■■■■■■■■■■■■■■■■■■■encoding_stats = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■PageEncodingStats
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■page_type = 2
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■count = 1
■■■■■■■■■■■■■■■■■■■■■■■■■■■■PageEncodingStats
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■encoding = 8
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■count = 1

parquet-tools inspect --detail cudf.pq

■■■■■■■■■■■■■■■■ColumnChunk
■■■■■■■■■■■■■■■■■■■■meta_data = ColumnMetaData
■■■■■■■■■■■■■■■■■■■■■■■■type = 2
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
■■■■■■■■■■■■■■■■■■■■■■■■path_in_schema = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■a
■■■■■■■■■■■■■■■■■■■■■■■■codec = 1
■■■■■■■■■■■■■■■■■■■■■■■■num_values = 2
■■■■■■■■■■■■■■■■■■■■■■■■total_uncompressed_size = 33
■■■■■■■■■■■■■■■■■■■■■■■■total_compressed_size = 29
■■■■■■■■■■■■■■■■■■■■■■■■data_page_offset = 4
■■■■■■■■■■■■■■■■■■■■■■■■statistics = Statistics
■■■■■■■■■■■■■■■■■■■■■■■■■■■■max_value = b'\x02\x00\x00\x00\x00\x00\x00\x00'
■■■■■■■■■■■■■■■■■■■■■■■■■■■■min_value = b'\x01\x00\x00\x00\x00\x00\x00\x00'

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2024-03-14T22:47:54Z

@etseidl Do you think this would be useful... or is it a waste of bytes?

etseidl · 2024-03-14T23:12:18Z

Do you think this would be useful... or is it a waste of bytes?

I did a quick look at the parquet-format site to see why it was added. It seems knowing that all pages are dictionary encoded helps with predicate pushdown. I assume you can just decode the dictionary page and eliminate entire column chunks if a filtering condition has no matches. Sounds like a good thing to add.

Closes #15313 Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #15452

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 14, 2024

GregoryKimball added this to the Parquet continuous improvement milestone Mar 14, 2024

GregoryKimball added this to libcudf Mar 14, 2024

etseidl mentioned this issue Apr 3, 2024

Add Parquet encoding statistics to column chunk metadata #15452

Merged

3 tasks

rapids-bot bot closed this as completed in #15452 Apr 26, 2024

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add `encoding_stats` footer metadata to the parquet writer #15313

[FEA] Add `encoding_stats` footer metadata to the parquet writer #15313

GregoryKimball commented Mar 14, 2024 •

edited

Loading

GregoryKimball commented Mar 14, 2024

etseidl commented Mar 14, 2024

[FEA] Add encoding_stats footer metadata to the parquet writer #15313

[FEA] Add encoding_stats footer metadata to the parquet writer #15313

Comments

GregoryKimball commented Mar 14, 2024 • edited Loading

GregoryKimball commented Mar 14, 2024

etseidl commented Mar 14, 2024

[FEA] Add `encoding_stats` footer metadata to the parquet writer #15313

[FEA] Add `encoding_stats` footer metadata to the parquet writer #15313

GregoryKimball commented Mar 14, 2024 •

edited

Loading