-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default datafusion.optimizer.prefer_existing_sort to true #8572
Comments
I believe it was a deliberate change. from the comment you reference #7671 (comment)
In IOx it is better for our usecase to use pre-existing sort orders, but I can see how for other uses cases it may not be. If there is a consensus that changing the default setting would be less surprising, it would be fine with me to change it. |
At the time of the original PR, we did a benchmark for comparison with following results. PLAN V1
PLAN V2
After these results, we decided to use existing default behavior (where V2 is preferred over V1) to not hurt performance for others. Frankly, changing default would be better for our use cases. As @alamb suggests if there is concencus we can change the default. I wonder @ozankabak's opinion regarding this change. |
Right.
Focusing on our use cases, we would be fine with changing the default (it fits much better to our use cases). However, it may result in a somewhat noticeable OOTB performance hit for some groups of users that use (or will use) DF for batch compute jobs. It may also hurt OOTB performance in certain batch benchmarks. I think it'd be a good idea to try to discuss this with a wider audience to better understand the implications. |
Perhaps I am missing something, but in the example plan in the issue, setting this option to true causes the optimiser to remove the SortExec, with no modification to the rest of the plan. I struggle to see how this would lead to a performance regression, and by extension why this would not be the default behaviour? Perhaps the setting is overly restrictive on the optimizer? |
Actually, it sets |
But it only has one input partition, why does it need a streaming merge to do order preserving repartition? |
By second I mean from bottom to top. In your original example "SortPreservingMergeExec: [a@0 ASC NULLS LAST]",
" SortExec: expr=[a@0 ASC NULLS LAST]",
" RepartitionExec: partitioning=Hash([c@1], 8), input_partitions=8",
" RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1",
" CsvExec: file_groups={1 group: [[file_path]]}, projection=[a, c, d], output_ordering=[a@0 ASC NULLS LAST], has_header=true",
"SortPreservingMergeExec: [a@0 ASC NULLS LAST]",
" RepartitionExec: partitioning=Hash([c@1], 8), input_partitions=8, preserve_order=true",
" RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1",
" CsvExec: file_groups={1 group: [[file_path]]}, projection=[a, c, d], output_ordering=[a@0 ASC NULLS LAST], has_header=true", In this case |
Aah makes sense. 👍 Tbh in that case I'd hope DF would strip all the repartitioning and merges out, they're all unnecessary and will just make it slower, but I suspect that's a different issue. Thanks for the discussion, closing this for now |
Is your feature request related to a problem or challenge?
Whilst working on #8540 I was surprised to see removing unbounded causing the DataFusion optimizer to not remove the
SortExec
from the below plan:Doing some spelunking this appears to be a regression introduced by #7671 (comment)
Describe the solution you'd like
I can't see an obvious reason to not enable this by default, as it seems like the more reasonable default, and also consistent with how I historically remember DataFusion behaving
Describe alternatives you've considered
No response
Additional context
FYI @alamb @ozankabak
The text was updated successfully, but these errors were encountered: