-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful Task Termination #15478
Graceful Task Termination #15478
Conversation
1e33d08
to
a5fb804
Compare
a5fb804
to
4499bfe
Compare
9be92d5
to
66e2687
Compare
e4a8704
to
019ac2f
Compare
Can you briefly summarize in PR description what are you working on here? :) |
019ac2f
to
a2e0b9d
Compare
59e9f7b
to
735827d
Compare
dffec68
to
f72da51
Compare
02c3a8e
to
5b39d6e
Compare
|
||
public void checkTaskTermination() | ||
{ | ||
if (liveCreatedDrivers.get() == 0 && taskStateMachine.getState().isTerminating()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I was thinking. I don't know if this actually happens, but I think the code allows for this.
thread 1: tryCreateNewDriver incrementAndGet
thread 1: if isTerminatingOrDone is not entered
thread x: terminate task
thread 2: liveCreatedDrivers.get() will equal 1 and state is terminating is true, so terminationComplete is not called
|
||
public void checkTaskTermination() | ||
{ | ||
if (liveCreatedDrivers.get() == 0 && taskStateMachine.getState().isTerminating()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For safety, I typically check <= 0
, it mask over bugs (you could log), but it prevents bugs from locking up resources. Also, calling check termination periodically after termination until the task is actually all cleaned up. Again this can mask bugs, but it is better than locking up a server.
This is a very large commit, but unfortuantely it's an all or nothing change that fundamentally changes how tasks report their state throughout the engine on both workers and on the coordinator. Before this change, tasks would immediately change to some final state when cancelled, failed, or told to abort- even while drivers were still running. Additionally, the coordinator would forcibly set the task state locally during an abort or some kinds of failure without the remote worker acknowledging that command. The result was that the coordinators view of the worker state was inconsistent for the purposes of scheduling (since drivers for tasks that were told to terminate may still be running for some amount of time), and the final task stats may have been incomplete. This change introduces CANCELING, ABORTING, and FAILING states that are considered "terminating" but not yet "done", and awaiting the final task driver to stop executing before transition to their corresponding "done" state.
Avoids unnecessary redundant TaskExecutor scheduling work when TaskExecutor#removeTask is called more than once as a result of terminating and done task state transition races.
5b39d6e
to
2257571
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
A little late to the party, but it would be better to pick a term other than Also, does this need docs? |
I’m open to renaming the state, but whatever new name we would choose should equally be applied to the “final” state name to maintain the symmetry of
I can’t find any docs describing the existing task states (let me know if I missed something there), so I don’t think that there’s an immediate need to document the new ones although documenting task, stage, and query state semantics might be worthwhile to do at some point. |
One other request... ideally.. could we fix the grammar and change from CANCELING to CANCELLING ? |
And with regards to ... ABORTING .. could we change to TERMINATING ? Although I do see the problem of having the existing status of "ABORTED" .. so maybe we have to leave that at ABORTING |
The problem I see with that is that failing, canceling, and aborting could all be considered “terminating” which is a useful concept by itself that unifies the three variations of “stopping the task early” that I’ve already used fairly heavily in this PR (e.g.: |
Fair ... I think the current setup works. I just wish we could fix the spelling ;-) |
Description
At a high level, this PR introduces a new concept of "terminating states" to
TaskState
which represent that a task has begun terminating but not all drivers that were created have fully exited yet. The new statesCANCELING
,ABORTING
, andFAILING
) are the "terminating" equivalents toCANCELED
,ABORTED
, andFAILED
which are the final state that tasks will transition into after all drivers have observed the termination signal and been destroyed.The motivation for this change is two-fold:
TaskInfo
response that coordinators would use to produce stage and query stats.Additional context and related issues
Assumptions to note in the implementation here:
SqlTaskExecution
if one has been created. Otherwise,SqlTask
will transition the state machine from terminating to termination complete (since no execution means no drivers can possibly be running).TaskState#isDone()
or checks forstate == TaskState.FAILED
will now only fire when termination is fully complete. In general this behavior is still correct, but potentially slower to react than before and there may be opportunities to more eagerly react to terminating states likeTaskState.FAILING
in fault tolerant execution.Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: