Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some resiliency to lost executors #568

Merged
merged 2 commits into from
Jun 26, 2021
Merged

Conversation

edrevo
Copy link
Contributor

@edrevo edrevo commented Jun 16, 2021

Rationale for this change

If an executor dies, the current behavior is that the job will fail or it might get stuck forever. This PR is a first step towards making it possible to recover from lost executors: if an executor is dead and it was running tasks or contained materialized partitions that need to be read, we just re-schedule the task.

There are many many more failure cases that the ones covered in this PR. I'll probably start opening issues with the different failure cases so we can track them.

What changes are included in this PR?

assign_next_schedulable_task will not re-schedule any tasks that were handled by executors that died.

I have tested this manually by killing an executor while running tpch query 12, and the query was able to finish and produce the correct result.

Are there any user-facing changes?

No

@edrevo
Copy link
Contributor Author

edrevo commented Jun 16, 2021

cc @andygrove

}
Err(_) => vec![],
};
// TODO: Display last seen information in UI
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @msathis: I am not super familiar with the UI side of the project, but it would be nice to display which executors died. I'll open an issue for this once this PR is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @edrevo Great work 👍 I can raise the follow up PR once this is merged! 👌

@codecov-commenter
Copy link

Codecov Report

Merging #568 (a4855ac) into master (e3e7e29) will decrease coverage by 0.06%.
The diff coverage is 38.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
- Coverage   76.12%   76.05%   -0.07%     
==========================================
  Files         156      156              
  Lines       27074    27067       -7     
==========================================
- Hits        20609    20585      -24     
- Misses       6465     6482      +17     
Impacted Files Coverage Δ
ballista/rust/executor/src/execution_loop.rs 0.00% <0.00%> (ø)
ballista/rust/scheduler/src/api/handlers.rs 0.00% <0.00%> (ø)
ballista/rust/scheduler/src/lib.rs 22.02% <ø> (+1.21%) ⬆️
ballista/rust/scheduler/src/state/etcd.rs 0.00% <0.00%> (ø)
ballista/rust/scheduler/src/state/mod.rs 67.05% <39.34%> (-3.44%) ⬇️
ballista/rust/scheduler/src/state/standalone.rs 76.23% <100.00%> (-0.91%) ⬇️
...afusion/src/physical_plan/expressions/nth_value.rs 78.78% <0.00%> (-11.69%) ⬇️
datafusion/src/physical_plan/expressions/count.rs 84.04% <0.00%> (-5.32%) ⬇️
datafusion/src/physical_plan/expressions/sum.rs 78.20% <0.00%> (-4.49%) ⬇️
...atafusion/src/physical_plan/expressions/average.rs 81.73% <0.00%> (-4.35%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3e7e29...a4855ac. Read the comment docs.

@alamb alamb added the ballista label Jun 16, 2021
@alamb alamb requested a review from andygrove June 23, 2021 13:10
@edrevo
Copy link
Contributor Author

edrevo commented Jun 23, 2021

merge conflicts resolved

@andygrove andygrove merged commit ccb7520 into apache:master Jun 26, 2021
@houqp houqp added the enhancement New feature or request label Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants