-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor scheduler state with different management policy for volatile and stable states #1810
Conversation
.receive_heart_beat(HeartBeatParams { | ||
executor_id: self.executor.metadata.id.clone(), | ||
state: Some(self.get_executor_state().await.into()), | ||
}) | ||
.await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is good to rename the method name to heart_beat_from_executor().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It's more clear than before.
async fn get_executor_state(&self) -> SExecutorState { | ||
SExecutorState { | ||
available_memory_size: u64::MAX, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this method is async ?
tokio_stream::wrappers::TcpListenerStream::new(listener), | ||
), | ||
); | ||
|
||
let executor_meta = ExecutorRegistration { | ||
id: Uuid::new_v4().to_string(), // assign this executor a unique ID | ||
optional_host: Some(OptionalHost::Host("localhost".to_string())), | ||
port: addr.port() as u32, | ||
// TODO Make it configurable | ||
grpc_port: 50020, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not hard code the grpc_port, please make it configurable otherwise we will be unable to startup multiple executor instances in one nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly for unit test. For production execution, it's already configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this PR doesn't introduce the hard coded port -- perhaps it is worth a ticket to track making the value configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually compile currently. Looks like the CI build doesn't compile/test with the standalone feature so I broke this with my PR from a few days ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I patched up the compile errors in #1839 -- looks like a test is also failing https://github.com/apache/arrow-datafusion/issues/1840
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb and @thinkharderdev, should we fix the standalone ut issue here or by another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yahoNanJing @alamb already pushed a PR to address the compilation issue #1839 so you're good.
.into_iter() | ||
.filter(|e| alive_executors.contains_key(&e.executor_id)) | ||
.collect(); | ||
let mut available_executors = self.state.get_available_executors_data(); | ||
|
||
// In case of there's no enough resources, reschedule the tasks of the job | ||
if available_executors.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why spawn another future and just sleep ? I think you can just simply sleep in the schedule_job() method if there is no enough resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Since the inner sleep is async and will not block the execution threads. Currently it's no need to spawn another thread for this.
#[derive(Clone)] | ||
pub(super) struct SchedulerState { | ||
struct VolatileSchedulerState { | ||
executors_heartbeat: Arc<std::sync::RwLock<HashMap<String, ExecutorHeartbeat>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use RwLock from parking_lot ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Should be consistent with other parts. And the performance may also be better.
|
||
// job -> stage -> partition | ||
tasks: Arc<std::sync::RwLock<HashMap<String, JobTasks>>>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job -> stage -> task
let stage_tasks = job_tasks | ||
.entry(partition_id.stage_id) | ||
.or_insert_with(HashMap::new); | ||
stage_tasks.insert(partition_id.partition_id, status.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename the partition_id to task_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It's better to rename the partition_id in the protobuf::TaskStatus to task_id
Since this PR has conflicts with the master code, I'll squash and merge first. Then I'll do the rebase. |
Cache volatile state just in memory without storing them in db Fix ut Keep volatile state just in memory rather than store them in db Cache stable state in memory Fix ut Fix for mingmwang's comments Rename partition_id to task_id in protobuf::TaskStatus Rename the names of the in-memory structs
Hi @yahoNanJing I will look at this tomorrow. Also FYI I think @houqp may be delayed in responding for a while. |
I will look this later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert in this code, and I can't say I understand all of the changes but the basics made sense to me and the tests all still pass, so 👍 from my end.
If @liukun4515 is good with this change I think it can be merged.
cc @edrevo @andygrove @thinkharderdev and @realno in case you would like to review
Otherwise I am happy to merge this PR tomorrow
tokio_stream::wrappers::TcpListenerStream::new(listener), | ||
), | ||
); | ||
|
||
let executor_meta = ExecutorRegistration { | ||
id: Uuid::new_v4().to_string(), // assign this executor a unique ID | ||
optional_host: Some(OptionalHost::Host("localhost".to_string())), | ||
port: addr.port() as u32, | ||
// TODO Make it configurable | ||
grpc_port: 50020, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this PR doesn't introduce the hard coded port -- perhaps it is worth a ticket to track making the value configurable
Thank you for the contributions @yahoNanJing as well as your patience in the review process |
Also cc @Ted-Jiang (I am sorry for the wide set of cc's but I don't know which of these contributors are working / coordinating together and I want to keep the information flowing between you all) |
executors_data.get(executor_id).cloned() | ||
} | ||
|
||
/// There are too checks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// There are too checks: | |
/// There are two checks: |
tokio::spawn(async move { | ||
info!("Starting the scheduler state watcher"); | ||
loop { | ||
let task_status = rx_task.recv().await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are few things here may panic, which will terminate the thread. I am wondering how this is handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @realno. Agree. It's better to just print error message for many cases. Will raise a commit for enhancing this error handling.
Left a small question, otherwise looks good. Thanks @yahoNanJing ! |
ballista/rust/scheduler/src/lib.rs
Outdated
} | ||
let tx_job = self.scheduler_env.as_ref().unwrap().tx_job.clone(); | ||
for job_id in jobs { | ||
tx_job.send(job_id).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tx_job.send(job_id).await.unwrap(); | |
tx_job.send(job_id).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle panic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang. Agree. It's better to use ? instead of directly use unwrap for panic.
Hi @realno, could you help review this error handling commit? |
Given how large this PR is getting (and thus its potential for accumulating conflicts) I am going to merge it to main and we can keep iterating in future PRs. I looked at the error handling commit at 12b1c73 and it looked better than |
Thank you everyone who helped review and thanks to @yahoNanJing for the contribution! |
Look good, Thanks! |
Which issue does this PR close?
Closes #1703.
Rationale for this change
See the detailed discussion in #1703.
What changes are included in this PR?
Classify the states in the SchedulerState into two categories: VolatileSchedulerState and StableSchedulerState.
According to #1703, the VolatileSchedulerState maintains the following states in memory:
And StableSchedulerState maintains the following states in both memory and storage db:
The stable states will be stored in both memory and storage db. The in-memory ones are the cache for fast reading and reducing deserialization cost.
The previous Watch in ConfigBackendClient for task status is no longer used. Instead, we leverage an event channel to achieve the job status update. Later with more states info, like pending tasks, the job status update can be handled explicitly so that the current cumbersome update implementation will no longer be used.