-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi-scheduler deployments #59
Support for multi-scheduler deployments #59
Conversation
…-scheduler # Conflicts: # ballista/rust/executor/src/execution_loop.rs # ballista/rust/scheduler/src/scheduler_server/event_loop.rs # ballista/rust/scheduler/src/scheduler_server/external_scaler.rs # ballista/rust/scheduler/src/scheduler_server/grpc.rs # ballista/rust/scheduler/src/scheduler_server/mod.rs # ballista/rust/scheduler/src/scheduler_server/query_stage_scheduler.rs # ballista/rust/scheduler/src/state/mod.rs # ballista/rust/scheduler/src/state/persistent_state.rs # ballista/rust/scheduler/src/state/task_scheduler.rs
string job_id = 1; | ||
string session_id = 2; | ||
JobStatus status = 3; | ||
repeated ExecutionGraphStage stages = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear to me how these stages form a dag. Are the dependencies between stages stored elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sorry I should have explained that. I'll add better docs to this struct, but for now each stage has an output_link: Option<usize>
which specifies where it sends it's output. If output_link
is None
then the stage is final and it sends its output to the ExecutionGraph
s output_locations
. Likewise, each stage has a inputs: HashMap<usize,StageOuput>
which "collects" input locations from its input stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Yes, some comments in the structs here would be great,
I ran the integration tests ( |
@thinkharderdev wow! a lot change 👍, I have some questions: |
@@ -59,15 +58,14 @@ impl DistributedPlanner { | |||
/// Returns a vector of ExecutionPlans, where the root node is a [ShuffleWriterExec]. | |||
/// Plans that depend on the input of other plans will have leaf nodes of type [UnresolvedShuffleExec]. | |||
/// A [ShuffleWriterExec] is created whenever the partitioning changes. | |||
pub async fn plan_query_stages<'a>( | |||
pub fn plan_query_stages<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason remove the async
🤔
I think there are some io work in plan_query_stages
like save status in db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was doing that at tone time but the implementation now is not doing IO so I changed it back to sync.
No, it is one-to-multi (or really multi-to-multi in principle). The executor available slots are stored in the state backend (either etcd or Sled). But when an executor publishes a task status update to the scheduler, then the particular scheduler which receives that request can re-asssign the executor slots without reading the backend state because the slot hasn't been returned to the pool yet. The session issue is interesting. For right now, we just save the session properties in the backend state and whenever we need to get a |
Thanks @yahoNanJing for the review.
|
Hi @thinkharderdev,
Maybe I did not make my point clear. My point is for one job, there's always be only one active scheduler and others for standby purpose. However, different jobs can have its own specific active scheduler. For the physical plan generation for a query, I think it's dealt with in one scheduler, right? |
Sorry, yes I think that makes sense. If each job was "owned" by a single scheduler then we could avoid a lot of overhead and synchronization on the DAG state. |
@yahoNanJing Made a small change in the event loop. It will now eagerly attempt to schedule additional pending tasks for a job on update. I think this should address your point from 1 above. |
Hi @thinkharderdev, just left a few comments. |
Hi @yahoNanJing I don't see any comments. Did you submit them? |
…-scheduler # Conflicts: # ballista/rust/executor/src/execution_loop.rs
Hi @thinkharderdev, it's aside the code. |
I don't see any comments :) |
@andygrove @yahoNanJing What do we think of this PR? |
Thanks @thinkharderdev. Sorry for response late. I think we can merge this PR first to avoid dealing with the conflicts repeatedly. Then we can continue refining step by step. |
Thank you @thinkharderdev and @yahoNanJing ! |
@thinkharderdev Looks like there is a merge conflict that needs resolving |
Scheduler fixes
…-scheduler # Conflicts: # ballista/rust/scheduler/src/scheduler_server/grpc.rs # ballista/rust/scheduler/src/scheduler_server/mod.rs
/// Map from stage ID -> List of child stage IDs | ||
stage_dependencies: HashMap<usize, Vec<usize>>, | ||
/// Map from Stage ID -> output link | ||
output_links: HashMap<usize, usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One stage may be the input of multiple sub-stages. Therefore, the data structure of output_links should be HashMap<usize, Vec>
decode_into::<protobuf::ExecutorData, ExecutorData>(&value)?; | ||
let take = std::cmp::min(data.available_task_slots, desired); | ||
|
||
for _ in 0..take { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a different policy from the previous round-robin one. Not sure whether it's better for the tasks be scheduled evenly to the executors?
@@ -105,9 +105,14 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> | |||
.executor_manager | |||
.cancel_reservations(free_list) | |||
.await?; | |||
Ok(None) | |||
} else if pending_tasks > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need else here? ❓
.state | ||
.task_manager | ||
.fill_reservations(&reservations) | ||
.await | ||
{ | ||
Ok((assignments, mut unassigned_reservations)) => { | ||
Ok((assignments, mut unassigned_reservations, pending_tasks)) => { | ||
for (executor_id, task) in assignments.into_iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to classify tasks for the same executor and then can be launched only once?
Which issue does this PR close?
Closes #39
Posting this draft PR for review and feedback but there are some more TODO items still in progress (mostly around cleanup and unit test coverage) but this change passes the integration tests as is so should be ready for a "test-drive" now.
TODO (before merging)
ExecuteQuery
so we don't do all the planning as part of the gRPC handler (I changed this to simplify things while testing but need to revert back to the old implementation where we do the planning in an async task in the event loop)Rationale for this change
See #39 for a complete description but this change addresses the following issues:
What changes are included in this PR?
This is quite a large refactor. See the key points below:
Event Loop
The core event loop remains mostly unchanged. The biggest change is that stage scheduling has been mostly internalized by the
ExecutionGraph
so we only publishJobSubmitted
andJobCompleted
query stage events/ Likewise, the actual structure ofSchedulerServerEvents
is changed (see details below).ExecutionGraph
In the current implementation, the state of a job is spread across a number of different data structures so would be difficult to move to external state in the current form, especially in light of requirements around distributed locks.
In this change, the
TaskScheduler
andStageManager
have been replaced with anExecutionGraph
data structure which is serializable (so it can be persisted in the state backend easily) and internalized both the underlying execution plan and dependencies between query stages. This data structure internalizes the order of execution and availability of tasks and presents an interface that allows the scheduler to see a job as a stream of discrete tasks, which I believe makes the logic in the scheduler loop more straightforward.ExecutorReservation
This data structure represents a reserved task slot on a given executor which may optionally be tied to a specific job ID. The idea here is that to avoid lock contention on the backend data store, we "hang on" to task slots through the scheduler loop instead of immediately returning them to the pool. So when a scheduler gets a task status update, it has a new task slot reservation that it can try to fill and will only return that task slot to the pool if it cannot find a task that is ready.
TaskManager
The
TaskManager
encapsulates task/job management within the scheduler. The two most important things theTaskManager
does are:fill_reservations
which will take a list ofExecutorReservation
s and try to assign a task to each one (with preference given to the reservationsjob_id
if present). See the docs string for details about the implementation.update_task_statuses
which will apply task status updates received from the executors and return a list ofQueryStageSchedulerEvent
along with a list ofExecutorReservation
to be filled by the scheduler. See the docs string for details about the implementation.StateBackendClient
I've changed this trait slightly to help with this use case:
Keyspace
which is just a namespace for keys but we already use namespace for something else so I didn't want to overload the term. This is mostly just encoding a desired structure for the storage layer asKeyspace
is an enum and helping to remove boilerplate elsewhere in the codebase.lock
method scoped to aKeyspace
andkey
so we can lock individual resources. Using a single global mutex on the state is probably not scalable.scan
method which will return all key/values in a particular keyspace (with an optional limit).scan_keys
which will do the same asscan
but only return keysput_txn
method which allows atomically updating multiple key/values. Since we do batch updates in many places, this simplifies error handling. It is implemented in both Sled (using batch operations) and etcd (using transactions).mv
(because move is a reserved word in rust :)) that will atomically move a key from one keyspace to another.The Scheduler Flow
Putting this altogether, the conceptual flow in the scheduler works like so:
Push Scheduling
TaskManager::update_task_statuses
to apply updates and get back a set ofExecutorReservation
sSchedulerServerEvent::Offer
event with these reservationsTaskManager::fill_reservations
to attempt to assign tasks to each reservation, giving priority to jobs which were being updated.When a new job is submitted, we will try and reserve task slots up to the number of tasks in the job's initial stage and launch them.
Pull Scheduling
PollWorkRequest
TaskManager::update_task_statuses
TaskManager::fill_reservations
to try and fill it.PollWorkResponse
with the task that was assigned (if any).Are there any user-facing changes?
This change is mostly to the internal mechanics of the scheduler. The only user-facing change is to the
StateBackendClient
trait.Yes, the
StateBackendClient
contract is changed.