-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce StageManager for managing tasks stage by stage #1983
Conversation
Hi @alamb and @thinkharderdev, could you help review this PR? The details design can be seen in #1936. |
) -> Result<Option<SchedulerServerEvent>> { | ||
let mut available_executors = self.state.get_available_executors_data(); | ||
async fn offer_resources(&self, n: u32) -> Result<Option<SchedulerServerEvent>> { | ||
let mut available_executors = if self.is_test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, setting flags specific for tests in application logic is usually not an anti-pattern. It makes the code harder to read and adds unnecessary runtime overhead. If I understand it correctly, the main problem is we don't have executor heartbeats in tests causing get_alive_executors
to return an empty set? If so, I would recommend adding a helper method to set artificial executor heartbeats directly from tests so we can keep the core logic agnostics to tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just the heartbeat issue. For Push-based task scheduling, the schedule will launch tasks to the corresponding executors which should start up the GRPC service. It's a bit too heavy for a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a cfg
to achieve the same result. So something like:
#[cfg(not(test))]
pub(crate) fn get_available_executors_data(&self) -> Vec<ExecutorData> {
let mut res = {
let alive_executors = self.get_alive_executors_within_one_minute();
let executors_data = self.executors_data.read();
executors_data
.iter()
.filter_map(|(exec, data)| {
(data.available_task_slots > 0 && alive_executors.contains(exec))
.then(|| data.clone())
})
.collect::<Vec<ExecutorData>>()
};
res.sort_by(|a, b| Ord::cmp(&b.available_task_slots, &a.available_task_slots));
res
}
#[cfg(test)]
pub(crate) fn get_available_executors_data(&self) -> Vec<ExecutorData> {
let mut res: Vec<ExecutorData> =
self.executors_data.read().values().cloned().collect();
res.sort_by(|a, b| Ord::cmp(&b.available_task_slots, &a.available_task_slots));
res
}
and then
#[cfg(not(test))]
if num_tasks > 0 {
self.launch_tasks(&executors_delta_data, tasks_assigment)
.await?;
}
This still makes the code slightly harder to read but it removes any runtime overhead and also localizes the additional flags to two places.
With this PR, the performance of the load testing is improved a lot. Cluster:One scheduler + one executor of 4 task slots. Benchmark of Load Testing:
It seems there's some bug for the master branch when doing the load testing for the push-based task scheduling. Will check it later. It's fixed by #2006. Single Query PerformanceAs the comparison standard, we need to know the single query performance.
|
Have been travelling but but I will spend time reviewing this week. |
Hi @houqp, @thinkharderdev, @alamb, @yjshen, I've just fixed the bug for the master branch's push-based task scheduling by #2006. Could you help review that PR? Now we can have the whole view of benchmark results.
|
self.state | ||
.executor_manager | ||
.update_executor_data(executor_delta_data); | ||
// TODO check whether launching task is successful or not | ||
client.launch_task(LaunchTaskParams { task: tasks }).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the reverse order? Seems like a potential issue if launching the task fails then the executor's task slots will never be freed.
Doing it in reverse (launch tasks -> update state) could also be a problem, so maybe a write lock needs to be acquired before launching tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thinkharderdev, here, I didn't consider too much about the error handling previously. I think you suggestion of doing it reverse makes sense.
Before updating the executor data, the tasks status have already been updated into Running status. Whether task launch successful or not, these tasks will be dangling in Running status. It's better for us to leverage speculative task execution in the future. And we don't need to consider too much of the situation of partial tasks launching success.
Why do you think we need a write lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a thought, not sure if the performance cost would be justifies. But the idea is that updating the executor state and launching the tasks should in principle be an atomic transaction. If you update first and then the launching fails, you resources will never be released but if you launch first and then update state then you have a concurrency issue where another thread may try and launch tasks after the task are launched and before the state is updated so you end up overcommitted.
The second issue seems less of a problem because it seems pretty straightforward to just have the executor reject the task if it has no available task slots, so it may be enough to just reverse the order here (launch tasks -> update state) and let the existing error handling mechanisms deal with it. Since launching requires a network call, holding a write lock on the entire state for that operation could be a major performance bottleneck, so if we did need a lock it would probably have to be at the individual executor level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thinkharderdev. Actually, for the current design, the method launch_tasks will only be invoked in the single consumer side of the channel, which means there will be no concurrency issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, as you mentioned, the executor available task slots in the scheduler side is just kind of guidance for task scheduling. The real source of truth is in the executor side.
) -> Result<Option<SchedulerServerEvent>> { | ||
let mut available_executors = self.state.get_available_executors_data(); | ||
async fn offer_resources(&self, n: u32) -> Result<Option<SchedulerServerEvent>> { | ||
let mut available_executors = if self.is_test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a cfg
to achieve the same result. So something like:
#[cfg(not(test))]
pub(crate) fn get_available_executors_data(&self) -> Vec<ExecutorData> {
let mut res = {
let alive_executors = self.get_alive_executors_within_one_minute();
let executors_data = self.executors_data.read();
executors_data
.iter()
.filter_map(|(exec, data)| {
(data.available_task_slots > 0 && alive_executors.contains(exec))
.then(|| data.clone())
})
.collect::<Vec<ExecutorData>>()
};
res.sort_by(|a, b| Ord::cmp(&b.available_task_slots, &a.available_task_slots));
res
}
#[cfg(test)]
pub(crate) fn get_available_executors_data(&self) -> Vec<ExecutorData> {
let mut res: Vec<ExecutorData> =
self.executors_data.read().values().cloned().collect();
res.sort_by(|a, b| Ord::cmp(&b.available_task_slots, &a.available_task_slots));
res
}
and then
#[cfg(not(test))]
if num_tasks > 0 {
self.launch_tasks(&executors_delta_data, tasks_assigment)
.await?;
}
This still makes the code slightly harder to read but it removes any runtime overhead and also localizes the additional flags to two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to spend some more time to understand fully but had a couple comments based on a quick review.
Thanks @thinkharderdev for your suggestion. I used cfg at the beginning. However, I don't know whether this way is acceptable or not. If we all prefer this way, I'll change it back to cfg way. By the way, if we use cfg, how to avoid the warning message for cargo test, like "warning: unused variable: |
I think the ExecutionContext should not a member of QueryStageScheduler, since we will support session context soon.
|
Yeah, I'm relatively new to this project (and Rust in general) so not sure whether this is an acceptable approach either, but it seems better to me for three reasons:
For the warnings I think we can use |
Thanks @mingmwang. It makes sense not to include context to the QueryStageScheduler. Will refactor it. |
Thanks @thinkharderdev. Just fixed as you suggested. |
Hi @thinkharderdev, @houqp, @yjshen and @alamb, are there any more concerns for this PR? If not much, how about merging first and refine step by step in the future? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Very nice benchmark result @yahoNanJing ! I suggest we wait for @thinkharderdev 's review since he was actively reviewing PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @yahoNanJing!
Looks good to me. I'm cool with merging this and then iterating on any issues which may come up
Thanks @yahoNanJing @houqp @thinkharderdev @mingmwang! |
Which issue does this PR close?
Closes #1936, #1704.
Rationale for this change
Previously the task fetching time complexity is n^2. By introducing stage-based scheduling, it can be reduced to n. And it will also be easy for task error recovering and stage priority-based scheduling in the future.
What's more, by this PR, the performance of Pull-Staged task scheduling is also improved a lot due to there's no need to check all of the tasks status in the whole system.
What changes are included in this PR?
Details see #1936.
Are there any user-facing changes?
It's blocked by #1934.