Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful Task Termination #15478

Merged
merged 5 commits into from
Mar 14, 2023
Merged

Conversation

pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Dec 20, 2022

Description

At a high level, this PR introduces a new concept of "terminating states" to TaskState which represent that a task has begun terminating but not all drivers that were created have fully exited yet. The new states CANCELING, ABORTING, and FAILING) are the "terminating" equivalents to CANCELED, ABORTED, and FAILED which are the final state that tasks will transition into after all drivers have observed the termination signal and been destroyed.

The motivation for this change is two-fold:

  1. Previously, task termination would immediately transition the task into a terminal state, which means that any in-flight drivers would not record their stats in the final TaskInfo response that coordinators would use to produce stage and query stats.
  2. As a result of considering those tasks immediately "finished", in-flight drivers on the worker node may still be running but were not visible on the coordinator for the purposes of scheduling splits fairly across worker nodes for concurrent tasks.

Additional context and related issues

Assumptions to note in the implementation here:

  • The responsibility for when to transition from terminating to terminated is in SqlTaskExecution if one has been created. Otherwise, SqlTask will transition the state machine from terminating to termination complete (since no execution means no drivers can possibly be running).
  • Existing logic that uses TaskState#isDone() or checks for state == TaskState.FAILED will now only fire when termination is fully complete. In general this behavior is still correct, but potentially slower to react than before and there may be opportunities to more eagerly react to terminating states like TaskState.FAILING in fault tolerant execution.
  • When remote tasks fail to report full termination for a long period of time (eg: the worker is unresponsive / dead), we have to make a decision about how to proceed. This implementation will time out and fail on the coordinator after the maximum error duration has elapsed but I'm open to other implementations.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Add CANCELING, ABORTING, and FAILING task statuses for tasks waiting on running drivers to terminate fully

@cla-bot cla-bot bot added the cla-signed label Dec 20, 2022
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch from 1e33d08 to a5fb804 Compare December 20, 2022 23:42
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch from a5fb804 to 4499bfe Compare December 22, 2022 22:00
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch 10 times, most recently from 9be92d5 to 66e2687 Compare January 9, 2023 23:04
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch 3 times, most recently from e4a8704 to 019ac2f Compare January 10, 2023 21:48
@losipiuk
Copy link
Member

Can you briefly summarize in PR description what are you working on here? :)

@pettyjamesm pettyjamesm force-pushed the task-termination-states branch from 019ac2f to a2e0b9d Compare January 10, 2023 22:47
@pettyjamesm pettyjamesm changed the title WIP: Graceful Task Termination Graceful Task Termination Jan 10, 2023
@pettyjamesm pettyjamesm marked this pull request as ready for review January 10, 2023 22:48
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch 6 times, most recently from 59e9f7b to 735827d Compare January 17, 2023 19:25
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch 4 times, most recently from dffec68 to f72da51 Compare March 2, 2023 21:01
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch 3 times, most recently from 02c3a8e to 5b39d6e Compare March 8, 2023 19:58

public void checkTaskTermination()
{
if (liveCreatedDrivers.get() == 0 && taskStateMachine.getState().isTerminating()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I was thinking. I don't know if this actually happens, but I think the code allows for this.

thread 1: tryCreateNewDriver incrementAndGet
thread 1: if isTerminatingOrDone is not entered
thread x: terminate task
thread 2: liveCreatedDrivers.get() will equal 1 and state is terminating is true, so terminationComplete is not called


public void checkTaskTermination()
{
if (liveCreatedDrivers.get() == 0 && taskStateMachine.getState().isTerminating()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety, I typically check <= 0, it mask over bugs (you could log), but it prevents bugs from locking up resources. Also, calling check termination periodically after termination until the task is actually all cleaned up. Again this can mask bugs, but it is better than locking up a server.

This is a very large commit, but unfortuantely it's an all or nothing
change that fundamentally changes how tasks report their state
throughout the engine on both workers and on the coordinator. Before
this change, tasks would immediately change to some final state when
cancelled, failed, or told to abort- even while drivers were still
running. Additionally, the coordinator would forcibly set the task state
locally during an abort or some kinds of failure without the remote
worker acknowledging that command.

The result was that the coordinators view of the worker state was
inconsistent for the purposes of scheduling (since drivers for tasks
that were told to terminate may still be running for some amount of
time), and the final task stats may have been incomplete.

This change introduces CANCELING, ABORTING, and FAILING states that are
considered "terminating" but not yet "done", and awaiting the final
task driver to stop executing before transition to their corresponding
"done" state.
Avoids unnecessary redundant TaskExecutor scheduling work when
TaskExecutor#removeTask is called more than once as a result of
terminating and done task state transition races.
@pettyjamesm pettyjamesm force-pushed the task-termination-states branch from 5b39d6e to 2257571 Compare March 13, 2023 20:45
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@dain dain merged commit 1ca635e into trinodb:master Mar 14, 2023
@pettyjamesm pettyjamesm deleted the task-termination-states branch March 14, 2023 00:27
@github-actions github-actions bot added this to the 411 milestone Mar 14, 2023
@colebow
Copy link
Member

colebow commented Mar 14, 2023

A little late to the party, but it would be better to pick a term other than ABORTING if it can be helped.

Also, does this need docs?

@pettyjamesm
Copy link
Member Author

A little late to the party, but it would be better to pick a term other than ABORTING if it can be helped.

I’m open to renaming the state, but whatever new name we would choose should equally be applied to the “final” state name to maintain the symmetry of FAILING -> FAILED, CANCELING -> CANCELED and we’ve already got meaningful semantics that separate aborts from cancels so we’d need a new and different word.

Also, does this need docs?

I can’t find any docs describing the existing task states (let me know if I missed something there), so I don’t think that there’s an immediate need to document the new ones although documenting task, stage, and query state semantics might be worthwhile to do at some point.

@mosabua
Copy link
Member

mosabua commented Mar 14, 2023

One other request... ideally.. could we fix the grammar and change from CANCELING to CANCELLING ?

@mosabua
Copy link
Member

mosabua commented Mar 14, 2023

And with regards to ... ABORTING .. could we change to TERMINATING ? Although I do see the problem of having the existing status of "ABORTED" .. so maybe we have to leave that at ABORTING

@pettyjamesm
Copy link
Member Author

And with regards to ... ABORTING .. could we change to TERMINATING ?

The problem I see with that is that failing, canceling, and aborting could all be considered “terminating” which is a useful concept by itself that unifies the three variations of “stopping the task early” that I’ve already used fairly heavily in this PR (e.g.: TaskState#isTerminatingOrDone())

@mosabua
Copy link
Member

mosabua commented Mar 15, 2023

Fair ... I think the current setup works. I just wish we could fix the spelling ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants