Add some resiliency to lost executors #568

edrevo · 2021-06-16T11:13:03Z

Rationale for this change

If an executor dies, the current behavior is that the job will fail or it might get stuck forever. This PR is a first step towards making it possible to recover from lost executors: if an executor is dead and it was running tasks or contained materialized partitions that need to be read, we just re-schedule the task.

There are many many more failure cases that the ones covered in this PR. I'll probably start opening issues with the different failure cases so we can track them.

What changes are included in this PR?

assign_next_schedulable_task will not re-schedule any tasks that were handled by executors that died.

I have tested this manually by killing an executor while running tpch query 12, and the query was able to finish and produce the correct result.

Are there any user-facing changes?

No

edrevo · 2021-06-16T11:13:48Z

cc @andygrove

edrevo · 2021-06-16T11:15:24Z

ballista/rust/scheduler/src/api/handlers.rs

-        }
-        Err(_) => vec![],
-    };
+    // TODO: Display last seen information in UI


cc @msathis: I am not super familiar with the UI side of the project, but it would be nice to display which executors died. I'll open an issue for this once this PR is merged.

Thanks @edrevo Great work 👍 I can raise the follow up PR once this is merged! 👌

codecov-commenter · 2021-06-16T11:30:20Z

Codecov Report

Merging #568 (a4855ac) into master (e3e7e29) will decrease coverage by 0.06%.
The diff coverage is 38.96%.

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
- Coverage   76.12%   76.05%   -0.07%     
==========================================
  Files         156      156              
  Lines       27074    27067       -7     
==========================================
- Hits        20609    20585      -24     
- Misses       6465     6482      +17

Impacted Files	Coverage Δ
ballista/rust/executor/src/execution_loop.rs	`0.00% <0.00%> (ø)`
ballista/rust/scheduler/src/api/handlers.rs	`0.00% <0.00%> (ø)`
ballista/rust/scheduler/src/lib.rs	`22.02% <ø> (+1.21%)`	⬆️
ballista/rust/scheduler/src/state/etcd.rs	`0.00% <0.00%> (ø)`
ballista/rust/scheduler/src/state/mod.rs	`67.05% <39.34%> (-3.44%)`	⬇️
ballista/rust/scheduler/src/state/standalone.rs	`76.23% <100.00%> (-0.91%)`	⬇️
...afusion/src/physical_plan/expressions/nth_value.rs	`78.78% <0.00%> (-11.69%)`	⬇️
datafusion/src/physical_plan/expressions/count.rs	`84.04% <0.00%> (-5.32%)`	⬇️
datafusion/src/physical_plan/expressions/sum.rs	`78.20% <0.00%> (-4.49%)`	⬇️
...atafusion/src/physical_plan/expressions/average.rs	`81.73% <0.00%> (-4.35%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3e7e29...a4855ac. Read the comment docs.

edrevo · 2021-06-23T16:05:08Z

merge conflicts resolved

Add some resiliency to lost executors

a4855ac

edrevo commented Jun 16, 2021

View reviewed changes

alamb added the ballista label Jun 16, 2021

alamb requested a review from andygrove June 23, 2021 13:10

Merge branch 'master' into add-resiliency

96fb6f3

andygrove approved these changes Jun 25, 2021

View reviewed changes

andygrove merged commit ccb7520 into apache:master Jun 26, 2021

houqp added the enhancement New feature or request label Jul 30, 2021

msathis mentioned this pull request Aug 16, 2021

[Ballista] Display Executor last seen info in the UI #894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some resiliency to lost executors #568

Add some resiliency to lost executors #568

edrevo commented Jun 16, 2021 •

edited

Loading

edrevo commented Jun 16, 2021

edrevo Jun 16, 2021

msathis Jun 18, 2021

codecov-commenter commented Jun 16, 2021

edrevo commented Jun 23, 2021

Add some resiliency to lost executors #568

Add some resiliency to lost executors #568

Conversation

edrevo commented Jun 16, 2021 • edited Loading

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

edrevo commented Jun 16, 2021

edrevo Jun 16, 2021

Choose a reason for hiding this comment

msathis Jun 18, 2021

Choose a reason for hiding this comment

codecov-commenter commented Jun 16, 2021

Codecov Report

edrevo commented Jun 23, 2021

edrevo commented Jun 16, 2021 •

edited

Loading