Refactor scheduler state with different management policy for volatile and stable states #1810

yahoNanJing · 2022-02-11T09:32:42Z

Which issue does this PR close?

Closes #1703.

Rationale for this change

See the detailed discussion in #1703.

What changes are included in this PR?

Classify the states in the SchedulerState into two categories: VolatileSchedulerState and StableSchedulerState.
According to #1703, the VolatileSchedulerState maintains the following states in memory:

executor heartbeat
executor data, mainly for available resources
tasks

And StableSchedulerState maintains the following states in both memory and storage db:

executor metadata, mainly for the cluster topology info
jobs
stages
The stable states will be stored in both memory and storage db. The in-memory ones are the cache for fast reading and reducing deserialization cost.

The previous Watch in ConfigBackendClient for task status is no longer used. Instead, we leverage an event channel to achieve the job status update. Later with more states info, like pending tasks, the job status update can be handled explicitly so that the current cumbersome update implementation will no longer be used.

mingmwang · 2022-02-11T09:50:32Z

ballista/rust/executor/src/executor_server.rs

+            .receive_heart_beat(HeartBeatParams {
+                executor_id: self.executor.metadata.id.clone(),
                state: Some(self.get_executor_state().await.into()),
            })
            .await


maybe it is good to rename the method name to heart_beat_from_executor().

Agree. It's more clear than before.

mingmwang · 2022-02-11T09:52:04Z

ballista/rust/executor/src/executor_server.rs

+    async fn get_executor_state(&self) -> SExecutorState {
+        SExecutorState {
            available_memory_size: u64::MAX,
        }
    }


Why this method is async ?

mingmwang · 2022-02-11T09:56:08Z

ballista/rust/executor/src/standalone.rs

-            tokio_stream::wrappers::TcpListenerStream::new(listener),
-        ),
-    );
+
    let executor_meta = ExecutorRegistration {
        id: Uuid::new_v4().to_string(), // assign this executor a unique ID
        optional_host: Some(OptionalHost::Host("localhost".to_string())),
        port: addr.port() as u32,
        // TODO Make it configurable
        grpc_port: 50020,


Please do not hard code the grpc_port, please make it configurable otherwise we will be unable to startup multiple executor instances in one nodes.

It's mainly for unit test. For production execution, it's already configurable.

Note that this PR doesn't introduce the hard coded port -- perhaps it is worth a ticket to track making the value configurable

This doesn't actually compile currently. Looks like the CI build doesn't compile/test with the standalone feature so I broke this with my PR from a few days ago.

I patched up the compile errors in #1839 -- looks like a test is also failing https://github.com/apache/arrow-datafusion/issues/1840

Hi @alamb and @thinkharderdev, should we fix the standalone ut issue here or by another PR?

@yahoNanJing @alamb already pushed a PR to address the compilation issue #1839 so you're good.

mingmwang · 2022-02-11T10:07:51Z

ballista/rust/scheduler/src/lib.rs

-            .into_iter()
-            .filter(|e| alive_executors.contains_key(&e.executor_id))
-            .collect();
+        let mut available_executors = self.state.get_available_executors_data();

        // In case of there's no enough resources, reschedule the tasks of the job
        if available_executors.is_empty() {


Why spawn another future and just sleep ? I think you can just simply sleep in the schedule_job() method if there is no enough resources.

Agree. Since the inner sleep is async and will not block the execution threads. Currently it's no need to spawn another thread for this.

mingmwang · 2022-02-11T10:14:37Z

ballista/rust/scheduler/src/state/mod.rs

 #[derive(Clone)]
-pub(super) struct SchedulerState {
+struct VolatileSchedulerState {
+    executors_heartbeat: Arc<std::sync::RwLock<HashMap<String, ExecutorHeartbeat>>>,


Should we use RwLock from parking_lot ?

Agree. Should be consistent with other parts. And the performance may also be better.

mingmwang · 2022-02-11T10:16:14Z

ballista/rust/scheduler/src/state/mod.rs

+
+    // job -> stage -> partition
+    tasks: Arc<std::sync::RwLock<HashMap<String, JobTasks>>>,
+}


job -> stage -> task

mingmwang · 2022-02-11T10:24:24Z

ballista/rust/scheduler/src/state/mod.rs

+        let stage_tasks = job_tasks
+            .entry(partition_id.stage_id)
+            .or_insert_with(HashMap::new);
+        stage_tasks.insert(partition_id.partition_id, status.clone());


Maybe we should rename the partition_id to task_id.

Agree. It's better to rename the partition_id in the protobuf::TaskStatus to task_id

yahoNanJing · 2022-02-14T04:12:21Z

Since this PR has conflicts with the master code, I'll squash and merge first. Then I'll do the rebase.

Cache volatile state just in memory without storing them in db Fix ut Keep volatile state just in memory rather than store them in db Cache stable state in memory Fix ut Fix for mingmwang's comments Rename partition_id to task_id in protobuf::TaskStatus Rename the names of the in-memory structs

yahoNanJing · 2022-02-14T08:53:39Z

Hi @alamb, @houqp, could you review this PR and give some comments?

alamb · 2022-02-14T22:17:58Z

Hi @yahoNanJing I will look at this tomorrow. Also FYI I think @houqp may be delayed in responding for a while.

liukun4515 · 2022-02-15T02:47:46Z

I will look this later.

alamb

I am not an expert in this code, and I can't say I understand all of the changes but the basics made sense to me and the tests all still pass, so 👍 from my end.

If @liukun4515 is good with this change I think it can be merged.

cc @edrevo @andygrove @thinkharderdev and @realno in case you would like to review

Otherwise I am happy to merge this PR tomorrow

alamb · 2022-02-15T18:42:18Z

ballista/rust/executor/src/standalone.rs

-            tokio_stream::wrappers::TcpListenerStream::new(listener),
-        ),
-    );
+
    let executor_meta = ExecutorRegistration {
        id: Uuid::new_v4().to_string(), // assign this executor a unique ID
        optional_host: Some(OptionalHost::Host("localhost".to_string())),
        port: addr.port() as u32,
        // TODO Make it configurable
        grpc_port: 50020,


Note that this PR doesn't introduce the hard coded port -- perhaps it is worth a ticket to track making the value configurable

alamb · 2022-02-15T18:49:41Z

Thank you for the contributions @yahoNanJing as well as your patience in the review process

alamb · 2022-02-15T18:50:39Z

Also cc @Ted-Jiang

(I am sorry for the wide set of cc's but I don't know which of these contributors are working / coordinating together and I want to keep the information flowing between you all)

realno · 2022-02-15T19:22:47Z

ballista/rust/scheduler/src/state/mod.rs

+        executors_data.get(executor_id).cloned()
+    }
+
+    /// There are too checks:


Suggested change

/// There are too checks:

/// There are two checks:

realno · 2022-02-15T21:28:01Z

ballista/rust/scheduler/src/state/mod.rs

+        tokio::spawn(async move {
+            info!("Starting the scheduler state watcher");
+            loop {
+                let task_status = rx_task.recv().await.unwrap();


There are few things here may panic, which will terminate the thread. I am wondering how this is handled?

Thanks @realno. Agree. It's better to just print error message for many cases. Will raise a commit for enhancing this error handling.

realno · 2022-02-16T00:20:30Z

Left a small question, otherwise looks good. Thanks @yahoNanJing !

Ted-Jiang · 2022-02-16T06:20:20Z

ballista/rust/scheduler/src/lib.rs

+        }
+        let tx_job = self.scheduler_env.as_ref().unwrap().tx_job.clone();
+        for job_id in jobs {
+            tx_job.send(job_id).await.unwrap();


Suggested change

tx_job.send(job_id).await.unwrap();

tx_job.send(job_id).await?;

Should we handle panic here?

Thanks @Ted-Jiang. Agree. It's better to use ? instead of directly use unwrap for panic.

yahoNanJing · 2022-02-16T07:34:59Z

Hi @realno, could you help review this error handling commit?

alamb · 2022-02-16T11:25:25Z

Given how large this PR is getting (and thus its potential for accumulating conflicts) I am going to merge it to main and we can keep iterating in future PRs.

I looked at the error handling commit at 12b1c73 and it looked better than unwrap() to me (though there is likely still significant room for UX improvement with error handling).

alamb · 2022-02-16T11:26:02Z

Thank you everyone who helped review and thanks to @yahoNanJing for the contribution!

realno · 2022-02-16T18:18:59Z

Hi @realno, could you help review this error handling commit?

Look good, Thanks!

github-actions bot added the ballista label Feb 11, 2022

mingmwang reviewed Feb 11, 2022

View reviewed changes

alamb approved these changes Feb 15, 2022

View reviewed changes

thinkharderdev approved these changes Feb 15, 2022

View reviewed changes

alamb mentioned this pull request Feb 15, 2022

Fix compiling ballista in standalone mode, add build to CI #1839

Merged

realno reviewed Feb 16, 2022

View reviewed changes

Ted-Jiang reviewed Feb 16, 2022

View reviewed changes

Enhance error handling

12b1c73

alamb merged commit 407adc0 into apache:master Feb 16, 2022

	tx_job.send(job_id).await.unwrap();
	tx_job.send(job_id).await?;

Refactor scheduler state with different management policy for volatile and stable states #1810

Refactor scheduler state with different management policy for volatile and stable states #1810

Conversation

yahoNanJing commented Feb 11, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing commented Feb 14, 2022

yahoNanJing commented Feb 14, 2022

alamb commented Feb 14, 2022

liukun4515 commented Feb 15, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 15, 2022

alamb commented Feb 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

realno commented Feb 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing commented Feb 16, 2022

alamb commented Feb 16, 2022

alamb commented Feb 16, 2022

realno commented Feb 16, 2022