parquet: Add an option to not parse the Page Index on each query #12547

progval · 2024-09-20T10:49:29Z

Is your feature request related to a problem or challenge?

CREATE TABLE does not parse the Page Index, and SELECT does not cache it. This can make requests on large Parquet datasets take a significant time for queries that have a small number of results.

For example, with a simple SELECT int_column, other_int_column WHERE int_column=123456 on a table with 184 billion rows (so about 9 million Page Index items, given the default 20k page size)

output_rows=0, elapsed_compute=96ns, num_predicate_creation_errors=0, page_index_rows_filtered=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=21050, row_groups_matched_bloom_filter=0, file_open_errors=0, file_scan_errors=0, bytes_scanned=25023432248, row_groups_matched_statistics=21050, pushdown_rows_filtered=0, row_groups_pruned_statistics=173576, time_elapsed_scanning_total=16.763964ms, page_index_eval_time=3.153918ms, time_elapsed_scanning_until_data=16.745759ms, time_elapsed_processing=61.531313027s, time_elapsed_opening=96.012649352s, pushdown_eval_time=382ns

Describe the solution you'd like

Parse it once and for all, either on CREATE TABLE or lazily as SELECT queries read the files. (Note that in the case of partitioned tables, not all files may be read by the first SELECT)

Describe alternatives you've considered

https://github.com/apache/datafusion/blob/3b93cc952b889cec2364ad2490ae18ecddb3ca49/datafusion-examples/examples/advanced_parquet_index.rs

but it requires using the low-level API, and is not available through the SQL or Python interfaces.

Additional context

No response

The text was updated successfully, but these errors were encountered:

progval added the enhancement New feature or request label Sep 20, 2024

This was referenced Sep 20, 2024

parquet: Add option to cache file metadata #12548

Closed

parquet: Add finer metrics on operations covered by time_elapsed_opening #12585

Merged

parquet: Add support for user-provided metadata loaders #12592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: Add an option to not parse the Page Index on each query #12547

parquet: Add an option to not parse the Page Index on each query #12547

progval commented Sep 20, 2024 •

edited

Loading

parquet: Add an option to not parse the Page Index on each query #12547

parquet: Add an option to not parse the Page Index on each query #12547

Comments

progval commented Sep 20, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

progval commented Sep 20, 2024 •

edited

Loading