-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some resiliency to lost executors #568
Conversation
cc @andygrove |
} | ||
Err(_) => vec![], | ||
}; | ||
// TODO: Display last seen information in UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @msathis: I am not super familiar with the UI side of the project, but it would be nice to display which executors died. I'll open an issue for this once this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @edrevo Great work 👍 I can raise the follow up PR once this is merged! 👌
Codecov Report
@@ Coverage Diff @@
## master #568 +/- ##
==========================================
- Coverage 76.12% 76.05% -0.07%
==========================================
Files 156 156
Lines 27074 27067 -7
==========================================
- Hits 20609 20585 -24
- Misses 6465 6482 +17
Continue to review full report at Codecov.
|
merge conflicts resolved |
Rationale for this change
If an executor dies, the current behavior is that the job will fail or it might get stuck forever. This PR is a first step towards making it possible to recover from lost executors: if an executor is dead and it was running tasks or contained materialized partitions that need to be read, we just re-schedule the task.
There are many many more failure cases that the ones covered in this PR. I'll probably start opening issues with the different failure cases so we can track them.
What changes are included in this PR?
assign_next_schedulable_task
will not re-schedule any tasks that were handled by executors that died.I have tested this manually by killing an executor while running tpch query 12, and the query was able to finish and produce the correct result.
Are there any user-facing changes?
No