-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task level retry and Stage level retry #261
Conversation
@thinkharderdev @andygrove @yahoNanJing |
Added 10+ UTs to cover different cases. |
For example, one SQL stage graph snapshot is as follows:
Both of Stage1 and Stage2 has output shuffle data resides on Executor1. Unluckily, the Executor1 gets lost. Then tasks for the running Stage3 will fail due to not able to fetch shuffle data. For this kind of scenario and error, by this PR, the ExecutionGraph will be able to continue running with proper task reset.
The stage graph snapshot will become as follows:
Once Stage1 finishes successfully, the graph snapshot will become as follows:
... The purpose of introducing task attempt and stage attempt is as follows:
|
); | ||
let mut should_ignore = true; | ||
// handle delayed failed tasks if the stage's next attempt is still in UnResolved status. | ||
if let Some(task_status::Status::Failed(failed_task)) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, only failed tasks will be dealt with. Once a running stage converted into an unresolved, it will discard all of its tasks info. Then to update the successful task info is meaningless.
@mingmwang I'm pretty swamped this week so won't have time to review this until this weekend. |
events.push(StageEvent::StageCompleted(stage_id)); | ||
// if this stage is completed, we want to combine the stage metrics to plan's metric set and print out the plan | ||
let is_final_successful = running_stage.is_successful() | ||
&& !reset_running_stages.contains_key(&stage_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this running stage is converted from successful stage, there may be in flight running tasks of the stages depending on this stage. When we find those tasks fails due to fetching data failure, then we also need to reset this running stage. Therefore, here, the reset_running_stages check is necessary.
Later, maybe we can make this error task recovery feature configurable. |
The overall design of this PR is good for me and it will be really useful to deal with executor lost issue and executor bad disk issue. @thinkharderdev, @andygrove, @avantgardnerio, could you help review this PR? |
Thanks, @mingmwang. I plan on starting to review this tomorrow. |
.await | ||
/// Partition reader Trait, different partition reader can have | ||
#[async_trait] | ||
trait PartitionReader: Send + Sync + Clone { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trait is much more extensible than conditional compiling 👍
@@ -424,54 +424,87 @@ message ExecutionGraph { | |||
uint64 output_partitions = 5; | |||
repeated PartitionLocation output_locations = 6; | |||
string scheduler_id = 7; | |||
uint32 tid_gen = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tid_gen
isn't a descriptive name. I am guessing this is short for task_id_gen
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will change to task_id_gen
.
uint32 map_partition_id = 1; | ||
PartitionId partition_id = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add comments explaining these different partition ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add more comments in the code. Here the new added map_partition_id
is the partition_id of the map stage who produce this shuffle data. But the original PartitionId partition_id
has different meanings in different places.
Sometimes it stands for a task_id, and here it stands for a shuffle partition id(composition of map_stage_id + reduce partition id + job_id), a mixed up of map and reduce infos together.
So if we do not consider the backward compatibility, I would suggest to unnest this and make the PartitionLocation a plan struct like blow:
message PartitionLocation {
string job_id = 1;
uint32 map_stage_id = 2;
uint32 map_partition_id = 3;
uint32 partition_id = 4;
ExecutorMetadata executor_meta = 5;
PartitionStats partition_stats = 6;
string path = 7;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is late here for me, but I think that makes sense. I will review it again tomorrow.
@mingmwang I have taken a first pass through this PR and I think it looks good. I am going to spend some time testing out the PR locally. It would be good to wait for Dan to review as well before we merge this. |
I tested this locally, and it worked really well! I ran one scheduler, and two executors and I could kill one executor and still see a query complete successfully. This is not the case in the master branch. I am happy to approve this once feedback has been addressed. |
Thank you. We will also start the chaos-monkey testing in next month to verify the recent changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mingmwang. This is a very nice improvement in stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few questions but this looks really great! Thanks for your work on this @mingmwang!
message ExecutorLost { | ||
} | ||
|
||
message ResultLost { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the difference between these two errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the difference between these two errors
Yes, good question. In the current code base, both the two errors are not used directly by the executor tasks.
They are used by the Scheduler. When we see a 'FetchPartitionError' task update from the reduce task, the related map task's status is changed to 'ResultLost'. Of cause most of the time ResultLost
should be caused by ExecutorLost
.
retryable: true, | ||
count_to_failures: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand retyrable
and count_to_failures
. If a task is retry-able wouldn't wouldn't it count to failures in all case. Conversely if it's not retry-able then we don't need to count the failure at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, have both retyrable
and count_to_failures
is to support the case that we might have some specific error that we want it to retry forever until it is successful.
} | ||
|
||
#[allow(dead_code)] | ||
// TODO | ||
struct LocalPartitionReader {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I assume where this is going is making the executors smart enough to read partitions on the same physical machine directly from disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I assume where this is going is making the executors smart enough to read partitions on the same physical machine directly from disk.
Yes, exactly.
task_id_gen: usize, | ||
/// Failed stage attempts, record the failed stage attempts to limit the retry times. | ||
/// Map from Stage ID -> Set<Stage_ATTPMPT_NUM> | ||
failed_stage_attempts: HashMap<usize, HashSet<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we are saving a HashSet
of attempts. Shouldn't it just be usize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we are saving a
HashSet
of attempts. Shouldn't it just beusize
?
The purpose of using HashSet is to record what are the exact distinct failed attempts. Using a usize
we will lose the detail.
Thanks @mingmwang for this huge step of error recovering. Thanks @andygrove and @thinkharderdev for reviewing this PR. Since we all approved this PR, I'll merge it first. |
Which issue does this PR close?
Closes #140.
Rationale for this change
What changes are included in this PR?
1. Task level retry,
a) Add task attempt num, use a job level unique task_id to represent the Task
b) Define couple of failure reasons for FailedTask
c) Reasoning the real task failure reason when return the TaskStatus back to Scheduler
d) Based on the failure reason, scheduler decide to reschedule the task and bump task attempt number of failed task.
2. Stage level retry and shuffle read failure handling
a) Add stage attempt num.
b) When there are shuffle partition fetch failures, the current running stage will be rolled back and the map stages will be resubmit.
c) handling of delayed fetch failure task updates
d) Cancel the running tasks if Scheduler decide to fail the stage/job.
And below two items are not covered in this PR and we may need to revisit the related logic in future:
Are there any user-facing changes?