Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling the writing of Parquet Offset Index #6778

Closed
etseidl opened this issue Nov 22, 2024 · 5 comments · Fixed by #6797
Closed

Allow disabling the writing of Parquet Offset Index #6778

etseidl opened this issue Nov 22, 2024 · 5 comments · Fixed by #6797
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@etseidl
Copy link
Contributor

etseidl commented Nov 22, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As of now, when writing Parquet files, the offset index structures are populated and written regardless of whether statistics and column indexes are written. It is unclear if this behavior is intended or not.

Describe the solution you'd like
Add a writer option to disable the collection and writing of the offset index. Of course the offset index is required if the column index is written, so this option would probably only be useful when not writing column indexes (i.e. when the statistics level is None or Chunk).

Describe alternatives you've considered
Alternatively, the writing of the offset index could be disabled whenever the column index is disabled (i.e. when the stats level is not Page). This solution assumes the current behavior is not intentional.

Additional context

@etseidl etseidl added the enhancement Any new improvement worthy of a entry in the changelog label Nov 22, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Nov 23, 2024

After some git dumpster diving, it seems like at one point the offset indexes were not written if the column indexes were not valid. https://github.com/apache/arrow-rs/blame/fba19b0142daed54c181cdb8f634f29cf7d37f8d/parquet/src/column/writer/mod.rs#L503-L510

Writing of the page indexes was decoupled in #4567. Since this is a special case where page indexes are desired, but the column index cannot be written due to all NaNs, it seems the original intent was to not write the offset index. I think parquet-rs should then not write offset indexes if page statistics are not enabled.

@etseidl
Copy link
Contributor Author

etseidl commented Nov 23, 2024

take

@tustvold
Copy link
Contributor

I think parquet-rs should then not write offset indexes if page statistics are not enabled.

There are valid use-cases where the offset index is beneficial, but the page statistics might not be desired. For example, if using some external index or statistics. The offset index is critical to being able to efficiently perform pushdown.

I think we can add an option to disable offset index generation, but we should make this an explicit option and stick a big warning that it may severely degrade read performance

@etseidl
Copy link
Contributor Author

etseidl commented Nov 23, 2024

Thanks @tustvold, sounds reasonable. I'll start on adding an option next week.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

label_issue.py automatically added labels {'parquet'} from #6797

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants