-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow disabling the writing of Parquet Offset Index #6778
Comments
After some git dumpster diving, it seems like at one point the offset indexes were not written if the column indexes were not valid. https://github.com/apache/arrow-rs/blame/fba19b0142daed54c181cdb8f634f29cf7d37f8d/parquet/src/column/writer/mod.rs#L503-L510 Writing of the page indexes was decoupled in #4567. Since this is a special case where page indexes are desired, but the column index cannot be written due to all NaNs, it seems the original intent was to not write the offset index. I think parquet-rs should then not write offset indexes if page statistics are not enabled. |
take |
There are valid use-cases where the offset index is beneficial, but the page statistics might not be desired. For example, if using some external index or statistics. The offset index is critical to being able to efficiently perform pushdown. I think we can add an option to disable offset index generation, but we should make this an explicit option and stick a big warning that it may severely degrade read performance |
Thanks @tustvold, sounds reasonable. I'll start on adding an option next week. |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As of now, when writing Parquet files, the offset index structures are populated and written regardless of whether statistics and column indexes are written. It is unclear if this behavior is intended or not.
Describe the solution you'd like
Add a writer option to disable the collection and writing of the offset index. Of course the offset index is required if the column index is written, so this option would probably only be useful when not writing column indexes (i.e. when the statistics level is
None
orChunk
).Describe alternatives you've considered
Alternatively, the writing of the offset index could be disabled whenever the column index is disabled (i.e. when the stats level is not
Page
). This solution assumes the current behavior is not intentional.Additional context
The text was updated successfully, but these errors were encountered: