-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbounded SortExec (and Top-K) Implementation When Req's Are Satisfied #12174
Unbounded SortExec (and Top-K) Implementation When Req's Are Satisfied #12174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we add a test for this change (so that we don't accidentally break it in some future refactor).
pub fn with_fetch(&self, fetch: Option<usize>) -> Self { | ||
let mut cache = self.cache.clone(); | ||
if fetch.is_some() { | ||
// When a theoretically unnecessary sort becomes a top-K (which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how a top-k sort would become bounded. I may misundersrtand what the ExecutionMode trait means, but it seems like TopK
could not complete until its input completed, but if its input was unbounded the sort itself therefore would also be unbounded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general you are right -- however, you are missing that we turn it into Bounded
only when the sort requirement is already satisfied. This happens when a sort "becomes" unnecessary during one of the plan optimization steps (and it will eventually get removed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how a top-k sort would become bounded. I may misundersrtand what the ExecutionMode trait means, but it seems like
TopK
could not complete until its input completed, but if its input was unbounded the sort itself therefore would also be unbounded
I misassumed the implementation of top-k. Could you please take a look to the new idea? I will update the PR title and body
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think the comments help explain what is going on here well. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the corner cases to match the theory. IMO this is ready to go -- @alamb a quick look would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @berkaysynnada and @ozankabak
I am sorry, but I am still struggling to understand this PR
It seems like this PR changes the execution plan for a Sort(limit = 5)
into a Limit
if the data is already sorted correctly (aka doesn't need an additional sort).
But in this case I would have expected the Sort to have been removed by one of the optimizer passes rather than the SortExec implementing a limit. 🤔
execution_options.sort_in_place_threshold_bytes, | ||
&self.metrics_set, | ||
context.runtime_env(), | ||
let sort_satisfied = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same calculation as self.execution_mode()
, right? Maybe we could call self.execution_mode
here instead to be more efficient and ensure the calculations remained in sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is similar but not exactly the same (i.e. execution mode is derived from sort_satisfied
but AFAIK the reverse is not possible). I think @berkaysynnada tried this but it didn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried that but didn't work. Knowing the sort is bounded or unbounded does not mean sort is satisfied.
} | ||
|
||
fn fetch(&self) -> Option<usize> { | ||
self.fetch | ||
} | ||
} | ||
|
||
struct TopKStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to add documentation to this struct, specifically that explains how it is different than TopK
fetch: usize, | ||
} | ||
|
||
impl Stream for TopKStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to LimitStream
-- https://docs.rs/datafusion-physical-plan/41.0.0/src/datafusion_physical_plan/limit.rs.html#434
though limit stream has metrics and some other features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe we can reuse it. We will take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- sent a commit to reuse LimitStream
Right. The sort will actually be entirely removed in later passes. However, in the meantime (before its removal), any rule looking at the execution mode will still think this TopK is pipeline-breaking (while it is not), which results in them behaving incorrectly. We caught this behavior downstream in the context of a custom rule. Basically, this PR does two things:
|
Thanks for taking another look -- incorporated your feedback to leverage |
I will go ahead and merge this soon since it is a small, localized change that improves corner cases without interfering with the main use case of SortExec. If there are any lingering concerns, we will address with a quick follow-on PR |
Thank you for responding to the feedback @berkaysynnada and @ozankabak -- sorry for my delay -- I have been out. This PR now looks good to me |
Which issue does this PR close?
Closes #.
Rationale for this change
SortExec (with or w/out fetch) can work without an actual sort if the existing input order is required.
What changes are included in this PR?
Are these changes tested?
Yes. I cannot practice the newly added stream types via an .slt test, but a unit test is added.
Are there any user-facing changes?