-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet parallel scan #5057
Parquet parallel scan #5057
Conversation
Only had time to take a brief look at this PR, and so I'm likely missing something but please bear with me 😄 This PR modifies I have two suggestions that may be stupid:
|
}; | ||
|
||
if collect_file_ranges { | ||
let file_ranges = parquet_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the way these ranges are applied in parquet is based on if the row group's midpoint lies within the given range, as a result there is no requirement that these ranges exactly delimit boundaries.
For example you could take a parquet file of 2GB and blindly chop it into 4x 512MB slices. This makes the assumption that there are at least 4 row groups and the row groups are similarly sized, in practice this is probably fine. This is what Spark does and avoids needing the file's metadata to do the optimisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, cutting file on N even parts will allow to read only row groups having their start offset inside corresponding ranges without any duplicate reads or skipped row groups - so, splitting could be much easier without using metadata (except for ObjectMetadata for the size)
@tustvold , thank you for the comments! Initially my intention was to handle scan planning as early as possible, so But, yeah, now I see that physical optimizer, and especially its I guess I'll convert this PR to draft and come up a bit later with updated version of this optimizer rule. |
I've reworked this PR by utilizing @tustvold, thank you for the tip about using physical optimizer! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really cool, left some minor comments
@@ -846,6 +871,182 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the test coverage
9cc44c4
to
9caa62a
Compare
53e8fc5
to
d6e95f7
Compare
d6e95f7
to
997b63e
Compare
I plan to review this carefully either later today or tomorrow. Very exciting @korowa -- thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code carefully and I really like it. Thank you @korowa -- do you have any performance benchmarks you can share? I think this will mostly help when scanning single large parquet files.
I would like to explore turning this feature on by default (perhaps we can have a separate ticket to track that)
I already feel bad that we don't have other parquet options enabled by default.
My measurements suggest this setting can improve the performance with single large parquet files significantly (over 2x in my measurement). 👨🍳 👌 -- very nice I tested this out by making a 9G parquet file from https://github.com/tustvold/access-log-gen/ Then using datafusion-cli: ❯ select avg(request_bytes), avg(response_bytes), avg(response_status), host from '/Users/alamb/Software/access-log-gen/logs.9G.parquet' group by host;
...
927 rows in set. Query took 2.313 seconds. And then I enabled this setting: ❯ set datafusion.optimizer.repartition_file_scans = true;
0 rows in set. Query took 0.000 seconds.
❯ select avg(request_bytes), avg(response_bytes), avg(response_status), host from '/Users/alamb/Software/access-log-gen/logs.9G.parquet' group by host;
927 rows in set. Query took 0.962 seconds. 😮 |
I have only these in my notes
which doesn't look like "official" benchmark at all 🙃 . These are results for this query from clickbench over 14GB parquet file on 2.6 GHz 6-Core Intel Core i7 |
Co-authored-by: Andrew Lamb <[email protected]>
As it works for now - yes, use case is mostly "relatively large files less than number of target_partitions" -- I guess it could be improved / reworked later to something like "perform repartitioning even for target_partitions in case there is significant skew in current partitioning"
I don't mind enabling parallelism by default and it seems to be the fastest way to deliver this feature, but (I'm not sure, just a suggestion) maybe better time for this will be in 1 (or 2) releases after the setting itself will be released? |
I think to make this effective we will need to have more runtime dynamics (aka using a morsel driven scheduler)
I agree -- let's get this PR merged in (default to off) and then plan to enable it by default in a few weeks (we just need to remember to do so!) |
"CoalescePartitionsExec", | ||
"AggregateExec: mode=Partial, gby=[], aggr=[]", | ||
// Multiple source files splitted across partitions | ||
"ParquetExec: limit=None, partitions={4 groups: [[x:0..75], [x:75..100, y:0..50], [y:50..125], [y:125..200]]}, projection=[c1]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is quite clever that the partitions have different parts of the same file 👍
I plan to leave this open for the rest of the weekend so others have a chance to comment if they want, and then merge on Monday |
Thanks again @korowa ❤️ |
Benchmark runs are scheduled for baseline = 74b05fa and contender = 67b1da8. 67b1da8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks @korowa this is really cool! 👍 nice work! from the code
How could we make sure the file is dived into row group? 🤔 |
I filed #5125 to track turning this on by default |
Without the parquet file metadata we can't reliably, but it isn't important to correctness that we do so. Not needing the metadata significantly simplifies the planning and avoids potentially costly round trips to object store whilst planning. FWIW I believe this is a similar approach as taken by Spark. |
I believe what happens is that the file is divided into byte ranges and then the row groups whose data falls within those ranges are scanned. It would be good to double check this undertanding though |
More specifically it is the row groups with a midpoint that falls into the range, this means that so long as the ranges are disjoint there is no risk of reading the same row group twice |
@Ted-Jiang here is the exact place where DF decides to read/not to read RowGroup depending on range. So it actually isn't required to split files on ranges with boundaries same as RowGroups boundaries. |
Thanks for all kindly reply ! ❤️
So I think i miss this part cause the misunderstanding😂 |
Yeah, I can't find any places where range is used, except for row_group pruning |
Thanks again for everyone ✌️ |
Which issue does this PR close?
Closes #137.
Rationale for this change
Improved performance for reading single parquet file / parquet files in quantity less than number of cores in multicore runtime
What changes are included in this PR?
repartition_file_scans
&repartition_file_min_size
optimizer settings - by default repartitioning of file scans disabled, and performed if total size of files to scan greater than 10MB (to avoid splitting small amount of small files)get_repartitioned
in ParquetExec - returns cloned object with rangedPartitionedFile
s redistributed file groups inbase_config
repartition.rs
- now callsget_repartitioned
forParquetExec
in case repartitioning is allowed (upstream operator benefit from it / no data ordering violations / etc.)As any other repartitioning operation, parallelization is applied only in case ParquetExec is underloaded in terms of partitions -- two files scan will be distributed over 4 partitions, but no redistribution will be performed in case of "2 files - 2 target partitions".
Are these changes tested?
Tests for
ParquetExec.get_repartitioned
added, and more tests added forrepartition
rule -- mostly copies of existing tests for cases when parallelization should be ignored, to ensure it won't break physical plans.Are there any user-facing changes?
New configuration settings
repartition_file_scans
&repartition_file_min_size