status | title | creation-date | last-updated | authors | |
---|---|---|---|---|---|
implemented |
Graceful Pipeline Run Termination |
2021-03-18 |
2021-12-15 |
|
- Summary
- Motivation
- Requirements
- Proposal
- Design Details
- Test Plan
- Design Evaluation
- Drawbacks
- Alternatives
- References (optional)
Marking a PipelineRun
as cancelled
stops all running tasks and deletes associated Pods. That prevents final tasks, specified in a Pipeline
under finally
section,
from being run.
There should be a way to terminate a PipelineRun
gracefully and wait for cleanup actions triggered by final tasks.
It is common that tasks when run trigger execution of external activities or request resources in foreign systems. Final tasks are perfectly suited for any cleanup operations that have to be performed when execution is completed. That is required in case of success, failure, as well as pipeline run cancellation.
Currently, there is no way to terminate a PipelineRun
gracefully.
The existing cancellation capability is problematic in the real use cases.
This is especially important for users of Kubeflow Pipelines with Tekton backend.
Kubeflow Pipelines supports exit handler
which guarantees that selected operations are triggered whenever pipeline run is completed.
Those actions are executed in case of a pipeline run cancellation as well.
Given the fact that Kubeflow Pipelines' ExitHandler
is implemented in Tekton using final tasks,
there is a significant inconsistency in the describe behaviour.
Lastly, final tasks should be triggered on a pipeline run timeout, which is a standard error scenario. Running final tasks infinitely should be prevented with the additional configuration of a finalization timeout. There is a separate proposal: TEP-0046 that covers this part.
Related issues:
- Add support for graceful a pipeline run termination / stop, that would wait for final tasks to be completed before termination.
- There is no intention to change the existing pipeline run cancellation behaviour, but rather provide an alternative one that would support graceful run termination.
- As a Kubeflow user, I run a training pipeline that spawns a
TFJob
. Once the pipeline execution is stopped, the training job should be terminated to limit resource usage. - As a CodeEngine user, I run a pipeline that submits a batch job. Once the pipeline execution is stopped, the batch job should be terminated to limit the service costs.
- As a kubernetes user, I run a pipeline that provisions a new cluster. The pipeline is executed and stopped. The k8s cluster resource should be freed up.
- As a Kubeflow user, I run a pipeline that executes some ETL actions in a sequence. At a time when an action is being processed, I want to cancel processing of following actions, but still analyze the results from the current one.
- Users should be able to gracefully terminate (cancel) a pipeline run and cleanup external resources.
- Users may want to wait for running tasks to be finished before stopping a pipeline run.
In this proposal the following 2 actions are differentiated:
- cancel means kill running tasks
- stop means let running tasks finish but no new tasks are scheduled
To gracefully terminate a PipelineRun
that's currently executing, but wait for final tasks to be run first,
users update its definition with states:
- "CancelledRunFinally" - cancel
PipelineRun
and ensurefinally
is run - "StoppedRunFinally" - stop
PipelineRun
and ensurefinally
is run
To gracefully cancel a PipelineRun
that's currently executing, users update its definition
to mark it as canceled, but request final tasks to be run first.
When you do so, the spawned non-final TaskRuns
are marked as cancelled and all associated Pods are deleted.
In parallel the final tasks are triggered.
For example:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
name: go-example-git
spec:
# […]
status: "CancelledRunFinally"
In the second scenario, users want to wait for running tasks to be completed.
To gracefully terminate a PipelineRun
, users update its definition to mark it as stopped.
When you do so, the spawned TaskRuns
are not cancelled.
The final tasks are triggered, when all running tasks are finalized.
For example:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
name: go-example-git
spec:
# […]
status: "StoppedRunFinally"
Support for additional command or flags in tkn
CLI should be considered.
No impact on performance.
In this proposal the list of statuses users can set in spec.status
is extended to:
don't run finally tasks | run finally tasks | |
---|---|---|
cancel | Cancelled | CancelledRunFinally |
stop | StoppedRunFinally |
The existing state "PipelineRunCancelled" is deprecated and replaced by "Cancelled".
We need to consider following cases:
-
User sets
spec.status
to "CancelledRunFinally" inPipelineRun
with running tasks, butfinally
section is empty.In this case a graceful termination would ack the same as a pipeline run cancellation.
spec.status
in all runningTaskRun
is patched to "TaskRunCancelled"spec.status
in all runningRun
is patched to "RunCancelled"PipelineRun
condition after graceful termination is:{ type: Succeeded, Status: False, Reason: Cancelled }
-
User sets
spec.status
to "CancelledRunFinally" inPipelineRun
with running tasks and non-emptyfinally
section.In this case a graceful termination cancels all running task runs and waits for final tasks to be processed.
spec.status
in all runningTaskRun
is patched to "TaskRunCancelled"spec.status
in all runningRun
is patched to "RunCancelled"PipelineRun
condition just after cancellation is:{ type: Succeeded, Status: Unknown, Reason: PipelineRunStopping }
- when final tasks are completed,
PipelineRun
condition is:{ type: Succeeded, Status: False, Reason: PipelineRunCancelled }
-
User sets
spec.status
to "StoppedRunFinally" inPipelineRun
with running tasks.In this case a graceful termination waits for all tasks to be completed.
PipelineRun
condition just after graceful stop is:{ type: Succeeded, Status: Unknown, Reason: PipelineRunStopping }
- when final tasks are completed,
PipelineRun
condition is:{ type: Succeeded, Status: False, Reason: PipelineRunCancelled }
-
User sets
spec.status
to "CancelledRunFinally" or "StoppedRunFinally" inPipelineRun
with running final tasks.In this case a graceful termination does not change a pipeline run state, which waits for all final tasks to be completed.
PipelineRun
condition is unchanged.
-
User sets
spec.status
to "CancelledRunFinally" or "StoppedRunFinally" inPipelineRun
with tasks not scheduled yet.When a pipeline run is gracefully terminated (in the way described above), any unscheduled non-final task is skipped and listed in
status.skippedTasks
inPipelineRun
.Final tasks, if present, are scheduled normally.
Relationship among PipelineRun
states:
-
If
PipelineRun
has stopped executing (i.e. the Succeeded Condition is False or True), then modifications tospec.status
should be rejected. Currently, such a validation is missing for "PipelineRunCancelled" state (replaced by "Cancelled"). -
"Cancelled" - if this state is set when the
PipelineRun
is already in "PipelineRunStopping" state, active final tasks should be cancelled and no task should be scheduled anymore. That way users can forcefully terminate final tasks.
"CancelledRunFinally" and "StoppedRunFinally" states changes the finally
behaviour,
which becomes an exit handler responsible for cleanup actions.
In the future, somebody may be interested in support for "Stopped" state,
which could allow stopping PipelineRun
, letting active tasks finish but no new tasks being scheduled
(including final tasks). That requires a separate TEP.
The new API value (non-breaking change).
-
Change the current behaviour of a pipeline run cancellation to be graceful by default (invoke finally tasks).
- Pros: new need for the new API value
- Cons: that would introduce the breaking change and even more importantly users would loose control over the expected behaviour, while the force termination is still useful in some cases.
-
Decide the termination strategy as an additional property of
finally
. A pipeline author would say whether final tasks should be run on cancel.- Pros: a pipeline author can specify expected behaviour.
- Cons: this would give to little control in runtime on the expected behaviour.
-
A variant of 2. with ability to overwrite the termination strategy in runtime.
- Pros: the default strategy can be specified by an author and changed in runtime.
- Cons: a bit more complex.