-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it possible to only scan part of a parquet file in a partition #1990
Conversation
cc @tustvold |
I'm not sure I follow how this will work, parquet files have a block structure internally that is not amenable to seeking. Particularly with RLE data, it is common for a column chunk to consist of a single page. Could you maybe expand a bit on this? On a more holistic level, is there some prior art on parellelising parquet reads, I've only ever encountered file and rarely row group parallelism... |
Hi @tustvold , the filter is based on row-group midpoint position. It was introduced recently in parquet crate with apache/arrow-rs@2bca71e. The midpoint filtering is modeled after the ParquetSplit and MetadataConverter The parquet row groups level parallelism is used in MapReduce and Spark. In Spark Currently, this PR is still WIP, since only physical plan changes are implemented. And we translate Spark physical plan to DataFusion physical plan to run natively in DataFusion https://github.com/blaze-init/spark-blaze-extension/blob/master/src/main/scala/org/apache/spark/sql/blaze/plan/NativeParquetScanExec.scala#L57-L63 |
Oh I think I misunderstood, this is using byte ranges to filter the row groups to scan, not to filter the rows within the row groups? That makes sense, and sounds like a useful addition 👍 |
Yes, to filter row groups, based on RowGroup Metadata as well. |
@yjshen I am very interested in discussing and participating in the parallelism of physical execution. I have some questions about this task.
|
@liukun4515 there are already some configuration settings related to parallelism
I am not sure how well these two parameters are respected in all DataFusion operators, but I think the configurations settings are reasonable |
Since @tustvold is working on the new task scheduler in DataFusion, I keep this PR containing only physical plan changes. Leave planner or API unchanged. The current changes are still helpful for downstream projects like Ballista or our Blaze, where query planning is done separately, no matter what we take later in DataFusion core. In Blaze, we use the Spark way of deciding InputSplits based on total dataset size and assigning parts of a big parquet file to multiple tasks. @alamb @tustvold @liukun4515, please let me know what you think about this. |
Makes sense to me, regardless of what happens with scheduling, having a mechanism to cheaply subdivide the input streams directly, as opposed to streaming the output through a repartitioning operator, seems like a useful feature to have. My expectation is scheduling will help with over-provisioned parallelism in the plan, but will still need mechanisms to express that parallelism in the first place 👍 If I have time, I'll take a look at this later today if nobody gets there first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good change to me. All that is missing is tests I think
My summary of this setting is that it would allow a user to get more parallelism in a plan by explicitly creating more partitions.
I believe that @tustvold is working on an alternate approach in #2079 and elsewhere that would decouple the plan's parallelism from its declared number of partitions, which might make this setting less valuable
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, sorry for the delay re-reviewing
🎉 |
Which issue does this PR close?
Part of #944
Rationale for this change
Open the possibility to only scan part of a parquet file in a task/partition.
What changes are included in this PR?
FileRange
to PartitionedFile.parquet
crate and filter row groups according to row groups' mid-point positions in the parquet file.Are there any user-facing changes?
No.