Task level retry and Stage level retry #261

mingmwang · 2022-09-21T16:13:32Z

Which issue does this PR close?

Closes #140.

Rationale for this change

What changes are included in this PR?

1. Task level retry,
a) Add task attempt num, use a job level unique task_id to represent the Task
b) Define couple of failure reasons for FailedTask

pub enum FailedReason {
        #[prost(message, tag="4")]
        ExecutionError(super::ExecutionError),
        #[prost(message, tag="5")]
        FetchPartitionError(super::FetchPartitionError),
        #[prost(message, tag="6")]
        IoError(super::IoError),
        #[prost(message, tag="7")]
        ExecutorLost(super::ExecutorLost),
        /// A successful task's result is lost due to executor lost
        #[prost(message, tag="8")]
        ResultLost(super::ResultLost),
        #[prost(message, tag="9")]
        TaskKilled(super::TaskKilled),
    }

c) Reasoning the real task failure reason when return the TaskStatus back to Scheduler
d) Based on the failure reason, scheduler decide to reschedule the task and bump task attempt number of failed task.

2. Stage level retry and shuffle read failure handling
a) Add stage attempt num.
b) When there are shuffle partition fetch failures, the current running stage will be rolled back and the map stages will be resubmit.
c) handling of delayed fetch failure task updates
d) Cancel the running tasks if Scheduler decide to fail the stage/job.

And below two items are not covered in this PR and we may need to revisit the related logic in future:

If the plans/expressions in the Stage is not deterministic, need to revisit the resubmit logic.
If we have a map stage whose shuffle output can be reused by multiple reduce stages, need to revisit the stage retry logic.

Are there any user-facing changes?

…o failed_task_retry

mingmwang · 2022-09-25T14:07:28Z

@thinkharderdev @andygrove @yahoNanJing
Please help to review my PR.

mingmwang · 2022-09-25T14:16:18Z

Added 10+ UTs to cover different cases.

yahoNanJing · 2022-09-28T07:07:43Z

With this PR, the state machine for the stage becomes as follows:

The transitions marked in red are newly added ones to deal with error task recovery.

yahoNanJing · 2022-09-28T08:35:43Z

For example, one SQL stage graph snapshot is as follows:

                                      Stage 4(Resolved)
                         ↗                                                 ↘
 Stage 1(Successful)                			                        Stage 5(Unresolved)
                         ↘                                                 ↗
           		      Stage 2(Successful)  ->  Stage 3(Running)

Both of Stage1 and Stage2 has output shuffle data resides on Executor1. Unluckily, the Executor1 gets lost. Then tasks for the running Stage3 will fail due to not able to fetch shuffle data. For this kind of scenario and error, by this PR, the ExecutionGraph will be able to continue running with proper task reset.

Firstly, Stage3 will be converted from Running to Unresolved. And Stage3 will ask its dependent Stage2 to rerun related tasks to prepare its input data.
Then Stage2 will be converted from Successful to Running. Unluckily, the related rerunning tasks for Stage2 will also fail due to not able to fetch shuffle data. Then Stage2 will be converted from Running to Unresolved and will ask its dependent Stage1 to rerun related tasks to prepare its input data.
Then Stage1 will be converted from Successful to Running.
Then Stage4 who dependent on Stage1 will be converted from Resolved to Unresolved.

The stage graph snapshot will become as follows:

                                      Stage 4(Unresolved)
                         ↗                                                 ↘
 Stage 1(Running)                			                        Stage 5(Unresolved)
                         ↘                                                 ↗
           		      Stage 2(Unresolved)  ->  Stage 3(Unresolved)

Once Stage1 finishes successfully, the graph snapshot will become as follows:

                                      Stage 4(Resolved)
                         ↗                                                 ↘
 Stage 1(Successful)                			                        Stage 5(Unresolved)
                         ↘                                                 ↗
           		      Stage 2(Resolved)  ->  Stage 3(Unresolved)

...

The purpose of introducing task attempt and stage attempt is as follows:

Stage attempt:
For a running stage, all of the task status stored should belong to the same stage attempt. The update of task statuses with oldest attempt number will be ignored.
Task attempt:
For some errors, like shuffle write IO error, they are retryable. It's feasible to reschedule this task to another executor to have a new attempt running. The attempt number will be reset to 0 if the stage starts a new attempt.

yahoNanJing · 2022-09-28T10:58:20Z

ballista/rust/scheduler/src/state/execution_graph.rs

+                        );
+                        let mut should_ignore = true;
+                        // handle delayed failed tasks if the stage's next attempt is still in UnResolved status.
+                        if let Some(task_status::Status::Failed(failed_task)) =


Here, only failed tasks will be dealt with. Once a running stage converted into an unresolved, it will discard all of its tasks info. Then to update the successful task info is meaningless.

thinkharderdev · 2022-09-28T13:37:55Z

@mingmwang I'm pretty swamped this week so won't have time to review this until this weekend.

yahoNanJing · 2022-09-28T14:27:36Z

ballista/rust/scheduler/src/state/execution_graph.rs

-                        events.push(StageEvent::StageCompleted(stage_id));
-                        // if this stage is completed, we want to combine the stage metrics to plan's metric set and print out the plan
+                    let is_final_successful = running_stage.is_successful()
+                        && !reset_running_stages.contains_key(&stage_id);


When this running stage is converted from successful stage, there may be in flight running tasks of the stages depending on this stage. When we find those tasks fails due to fetching data failure, then we also need to reset this running stage. Therefore, here, the reset_running_stages check is necessary.

yahoNanJing · 2022-09-28T14:39:05Z

Later, maybe we can make this error task recovery feature configurable.

yahoNanJing · 2022-09-28T14:46:26Z

The overall design of this PR is good for me and it will be really useful to deal with executor lost issue and executor bad disk issue. @thinkharderdev, @andygrove, @avantgardnerio, could you help review this PR?

andygrove · 2022-09-29T04:04:37Z

Thanks, @mingmwang. I plan on starting to review this tomorrow.

yahoNanJing · 2022-09-29T07:05:41Z

ballista/rust/core/src/execution_plans/shuffle_reader.rs

-        .await
+/// Partition reader Trait, different partition reader can have
+#[async_trait]
+trait PartitionReader: Send + Sync + Clone {


Trait is much more extensible than conditional compiling 👍

andygrove · 2022-09-29T21:46:03Z

ballista/rust/core/proto/ballista.proto

@@ -424,54 +424,87 @@ message ExecutionGraph {
  uint64 output_partitions = 5;
  repeated PartitionLocation output_locations = 6;
  string scheduler_id = 7;
+  uint32 tid_gen = 8;


nit: tid_gen isn't a descriptive name. I am guessing this is short for task_id_gen?

Sure, will change to task_id_gen.

andygrove · 2022-09-29T21:48:20Z

ballista/rust/core/proto/ballista.proto

+  uint32 map_partition_id = 1;
+  PartitionId partition_id = 2;


Could you add comments explaining these different partition ids?

Sure, will add more comments in the code. Here the new added map_partition_id is the partition_id of the map stage who produce this shuffle data. But the original PartitionId partition_id has different meanings in different places.
Sometimes it stands for a task_id, and here it stands for a shuffle partition id(composition of map_stage_id + reduce partition id + job_id), a mixed up of map and reduce infos together.
So if we do not consider the backward compatibility, I would suggest to unnest this and make the PartitionLocation a plan struct like blow:

message PartitionLocation { string job_id = 1; uint32 map_stage_id = 2; uint32 map_partition_id = 3; uint32 partition_id = 4; ExecutorMetadata executor_meta = 5; PartitionStats partition_stats = 6; string path = 7; }

It is late here for me, but I think that makes sense. I will review it again tomorrow.

andygrove · 2022-09-29T21:56:36Z

@mingmwang I have taken a first pass through this PR and I think it looks good. I am going to spend some time testing out the PR locally. It would be good to wait for Dan to review as well before we merge this.

ballista/rust/core/proto/ballista.proto

andygrove · 2022-09-29T22:33:52Z

I tested this locally, and it worked really well!

I ran one scheduler, and two executors and I could kill one executor and still see a query complete successfully. This is not the case in the master branch.

I am happy to approve this once feedback has been addressed.

mingmwang · 2022-09-30T02:14:42Z

I tested this locally, and it worked really well!

I ran one scheduler, and two executors and I could kill one executor and still see a query complete successfully. This is not the case in the master branch.

I am happy to approve this once feedback has been addressed.

Thank you. We will also start the chaos-monkey testing in next month to verify the recent changes.

…o failed_task_retry

andygrove

Thanks @mingmwang. This is a very nice improvement in stability.

thinkharderdev

Had a few questions but this looks really great! Thanks for your work on this @mingmwang!

thinkharderdev · 2022-10-01T19:26:45Z

ballista/rust/core/proto/ballista.proto

+message ExecutorLost {
+}
+
+message ResultLost {
+}


Not sure I understand the difference between these two errors

Not sure I understand the difference between these two errors

Yes, good question. In the current code base, both the two errors are not used directly by the executor tasks.
They are used by the Scheduler. When we see a 'FetchPartitionError' task update from the reduce task, the related map task's status is changed to 'ResultLost'. Of cause most of the time ResultLost should be caused by ExecutorLost.

thinkharderdev · 2022-10-01T19:35:15Z

ballista/rust/core/src/error.rs

+                    retryable: true,
+                    count_to_failures: true,


I'm not sure I understand retyrable and count_to_failures. If a task is retry-able wouldn't wouldn't it count to failures in all case. Conversely if it's not retry-able then we don't need to count the failure at all.

Yes, have both retyrable and count_to_failures is to support the case that we might have some specific error that we want it to retry forever until it is successful.

thinkharderdev · 2022-10-01T19:38:14Z

ballista/rust/core/src/execution_plans/shuffle_reader.rs

 }

+#[allow(dead_code)]
+// TODO
+struct LocalPartitionReader {}


Nice! I assume where this is going is making the executors smart enough to read partitions on the same physical machine directly from disk.

Nice! I assume where this is going is making the executors smart enough to read partitions on the same physical machine directly from disk.

Yes, exactly.

thinkharderdev · 2022-10-01T19:50:33Z

ballista/rust/scheduler/src/state/execution_graph.rs

+    task_id_gen: usize,
+    /// Failed stage attempts, record the failed stage attempts to limit the retry times.
+    /// Map from Stage ID -> Set<Stage_ATTPMPT_NUM>
+    failed_stage_attempts: HashMap<usize, HashSet<usize>>,


I don't get why we are saving a HashSet of attempts. Shouldn't it just be usize?

I don't get why we are saving a HashSet of attempts. Shouldn't it just be usize?

The purpose of using HashSet is to record what are the exact distinct failed attempts. Using a usize we will lose the detail.

yahoNanJing · 2022-10-02T09:49:55Z

Thanks @mingmwang for this huge step of error recovering. Thanks @andygrove and @thinkharderdev for reviewing this PR. Since we all approved this PR, I'll merge it first.

mingmwang added 2 commits September 21, 2022 23:53

Task level failure retry and Stage level failure retry

cef11e0

Merge branch 'master' of https://github.com/apache/arrow-ballista int…

7069ed1

…o failed_task_retry

yahoNanJing marked this pull request as draft September 22, 2022 19:15

mingmwang added 3 commits September 24, 2022 23:56

Add UT

48f6e3f

merge with upstream

e480877

merge with upstream

48ea878

mingmwang marked this pull request as ready for review September 25, 2022 13:59

fix fmt

ddfb014

mingmwang changed the title ~~Failed task retry~~ Task level retry and Stage level retry Sep 25, 2022

yahoNanJing requested review from andygrove, thinkharderdev and yahoNanJing September 26, 2022 10:23

yahoNanJing reviewed Sep 28, 2022

View reviewed changes

mingmwang added 2 commits September 29, 2022 13:21

Resolve conflicts

675df2a

Resolve conflicts

188405a

yahoNanJing reviewed Sep 29, 2022

View reviewed changes

andygrove reviewed Sep 29, 2022

View reviewed changes

ballista/rust/core/proto/ballista.proto Outdated Show resolved Hide resolved

andygrove reviewed Sep 29, 2022

View reviewed changes

ballista/rust/core/proto/ballista.proto Outdated Show resolved Hide resolved

mingmwang added 2 commits September 30, 2022 10:18

Merge branch 'master' of https://github.com/apache/arrow-ballista int…

989bc5b

…o failed_task_retry

Resolve review comments

0e0eecf

andygrove approved these changes Sep 30, 2022

View reviewed changes

thinkharderdev approved these changes Oct 1, 2022

View reviewed changes

yahoNanJing approved these changes Oct 2, 2022

View reviewed changes

yahoNanJing merged commit f5bfef0 into apache:master Oct 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task level retry and Stage level retry #261

Task level retry and Stage level retry #261

mingmwang commented Sep 21, 2022 •

edited

Loading

mingmwang commented Sep 25, 2022

mingmwang commented Sep 25, 2022

yahoNanJing commented Sep 28, 2022

yahoNanJing commented Sep 28, 2022 •

edited

Loading

yahoNanJing Sep 28, 2022

thinkharderdev commented Sep 28, 2022

yahoNanJing Sep 28, 2022

yahoNanJing commented Sep 28, 2022

yahoNanJing commented Sep 28, 2022

andygrove commented Sep 29, 2022

yahoNanJing Sep 29, 2022 •

edited

Loading

andygrove Sep 29, 2022

mingmwang Sep 30, 2022 •

edited

Loading

andygrove Sep 29, 2022

mingmwang Sep 30, 2022

andygrove Sep 30, 2022

andygrove commented Sep 29, 2022

andygrove commented Sep 29, 2022

mingmwang commented Sep 30, 2022

andygrove left a comment

thinkharderdev left a comment

thinkharderdev Oct 1, 2022

mingmwang Oct 2, 2022

thinkharderdev Oct 1, 2022

mingmwang Oct 2, 2022

thinkharderdev Oct 1, 2022

mingmwang Oct 2, 2022

thinkharderdev Oct 1, 2022

mingmwang Oct 2, 2022 •

edited

Loading

yahoNanJing commented Oct 2, 2022

Task level retry and Stage level retry #261

Task level retry and Stage level retry #261

Conversation

mingmwang commented Sep 21, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mingmwang commented Sep 25, 2022

mingmwang commented Sep 25, 2022

yahoNanJing commented Sep 28, 2022

yahoNanJing commented Sep 28, 2022 • edited Loading

Choose a reason for hiding this comment

thinkharderdev commented Sep 28, 2022

Choose a reason for hiding this comment

yahoNanJing commented Sep 28, 2022

yahoNanJing commented Sep 28, 2022

andygrove commented Sep 29, 2022

yahoNanJing Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Sep 29, 2022

andygrove commented Sep 29, 2022

mingmwang commented Sep 30, 2022

andygrove left a comment

Choose a reason for hiding this comment

thinkharderdev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang Oct 2, 2022 • edited Loading

Choose a reason for hiding this comment

yahoNanJing commented Oct 2, 2022

mingmwang commented Sep 21, 2022 •

edited

Loading

yahoNanJing commented Sep 28, 2022 •

edited

Loading

yahoNanJing Sep 29, 2022 •

edited

Loading

mingmwang Sep 30, 2022 •

edited

Loading

mingmwang Oct 2, 2022 •

edited

Loading