-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preload page index for async ParquetObjectReader #4090
Comments
The design of AsyncFileReader is already written in such a way as to allow this, in particular implementations may override get_metadata and return a Metadata that already has the page index loaded |
This is true, but there is no easy way to deserialize the page index asynchronously. Currently the easiest way I have found to do this is to fetch the relevant page index offsets, create a special implementation of I have found that the above approach works, although extremely hacky, but I'd ask that the maintainers of this library at least consider exposing a built-in way to deserialize the page index in async code. Again, something like |
👍 I will spend some time working out an async API for reading metadata, #3851 is also related |
PR #4216 |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the
ParquetMetaData
object has optional fields for the column & offset indexes which are unpopulated at first. When theArrowReaderBuilder
is created usingArrowReaderOptions::with_page_index(true)
it loads the page index at query time. However, this is potentially suboptimal as it incurs additional latency making an extra request (typically to object storage which is high-latency) for each query.Describe the solution you'd like
A new method for the
ParquetObjectReader
that toggles loading the page index at construction time, something like this:which would trigger conditional logic in the
get_metadata
function to return metadata with the page index already loaded.Describe alternatives you've considered
A public async API for deserializing the column & offset index, similar to
index_reader
but with async support and integrated withAsyncFileReader
to enable coalescing of multiple fetches.The text was updated successfully, but these errors were encountered: