Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Add an option to not parse the Page Index on each query #12547

Open
progval opened this issue Sep 20, 2024 · 0 comments
Open

parquet: Add an option to not parse the Page Index on each query #12547

progval opened this issue Sep 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@progval
Copy link
Contributor

progval commented Sep 20, 2024

Is your feature request related to a problem or challenge?

CREATE TABLE does not parse the Page Index, and SELECT does not cache it. This can make requests on large Parquet datasets take a significant time for queries that have a small number of results.

For example, with a simple SELECT int_column, other_int_column WHERE int_column=123456 on a table with 184 billion rows (so about 9 million Page Index items, given the default 20k page size)

output_rows=0, elapsed_compute=96ns, num_predicate_creation_errors=0, page_index_rows_filtered=0, predicate_evaluation_errors=0, row_groups_pruned_bloom_filter=21050, row_groups_matched_bloom_filter=0, file_open_errors=0, file_scan_errors=0, bytes_scanned=25023432248, row_groups_matched_statistics=21050, pushdown_rows_filtered=0, row_groups_pruned_statistics=173576, time_elapsed_scanning_total=16.763964ms, page_index_eval_time=3.153918ms, time_elapsed_scanning_until_data=16.745759ms, time_elapsed_processing=61.531313027s, time_elapsed_opening=96.012649352s, pushdown_eval_time=382ns

Describe the solution you'd like

Parse it once and for all, either on CREATE TABLE or lazily as SELECT queries read the files. (Note that in the case of partitioned tables, not all files may be read by the first SELECT)

Describe alternatives you've considered

https://github.com/apache/datafusion/blob/3b93cc952b889cec2364ad2490ae18ecddb3ca49/datafusion-examples/examples/advanced_parquet_index.rs

but it requires using the low-level API, and is not available through the SQL or Python interfaces.

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant