From 6264e61d74dfe645c5dbfd01bc89ccedf9992144 Mon Sep 17 00:00:00 2001 From: Prashant Sharma Date: Fri, 7 May 2021 16:07:24 +0530 Subject: [PATCH] TEP-0065: Retry failed tasks on demand in a pipeline KFP's use case. Co-authored-by: Tommy Li --- teps/0065-retry-failed-tasks-on-demand.md | 245 ++++++++++++++++++++++ teps/README.md | 1 + 2 files changed, 246 insertions(+) create mode 100644 teps/0065-retry-failed-tasks-on-demand.md diff --git a/teps/0065-retry-failed-tasks-on-demand.md b/teps/0065-retry-failed-tasks-on-demand.md new file mode 100644 index 000000000..8cbdb9a98 --- /dev/null +++ b/teps/0065-retry-failed-tasks-on-demand.md @@ -0,0 +1,245 @@ +--- +status: proposed +title: Retry failed tasks on-demand in a pipeline +creation-date: '2021-05-07' +last-updated: '2021-05-07' +authors: +- '@Tomcli' +- '@ScrapCodes' +--- + +# TEP-0065: Retry failed tasks on-demand, in a pipeline + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases (optional)](#use-cases-optional) +- [Requirements](#requirements) +- [Proposal](#proposal) + - [Notes/Caveats (optional)](#notescaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [User Experience (optional)](#user-experience-optional) + - [Performance (optional)](#performance-optional) +- [Design Details](#design-details) +- [Test Plan](#test-plan) +- [Design Evaluation](#design-evaluation) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) +- [Upgrade & Migration Strategy (optional)](#upgrade--migration-strategy-optional) +- [References (optional)](#references-optional) + + +## Summary + +Presently, a pipeline has a mechanism for `retry`, which a pipeline +author can configure at the time of creation of a `Pipeline` or a +`PipelineRun`. In this TEP, we are exploring the benefits of adding a new +mechanism `retry` which will allow a user to - "on-demand" retry a failed +`pipelineRun`. A failed `pipelineRun` may have some or all tasks failed, then +a retry would make only the failed tasks run again, the successfully +completed tasks are skipped. + +## Motivation + +**Optimal use of cluster resources.** + +Ability to `retry` failed tasks is especially useful, where `tekton` is a +backend for running Machine learning pipelines. A machine learning pipeline +may consist of tasks moving large amount of data and then training ml models, +all of it can be very resource consuming and inability to retry would require +a user to start the entire pipeline over. Sometimes, the failure could be due +to temporary service outages. For example, after training the model, a task +reporting the metrics fails due to temporary service outage. A retry after +some time could easily fix it. + +A pipeline may be defined with various tasks, and some tasks might move a +large amount of data and incur cost. This `retry` mechanism has substantial +value, where each task of the pipeline incurs a significant computing resources, +e.g. `tekton` is used as a backend for ML pipelines. + +_Why do we need a new `retry` mechanism when we already support retry in +`Pipeline` tasks?_ + +The present `retry` field can only be defined at the time of creation of +pipeline. This is not suitable for use cases, where a manual intervention +is necessary to decide whether a rerun is required or not. +For example, if a service outage is causing a particular task failure, then +retrying `n` times, won't help, unless we wait for the service to be back +again and retry. For such manual interventions, we need on-demand `retry` +mechanism. + +Another concocted example, if `Pipeline` were to represent a CI/CD job, then +tasks represent test suit, stress test and benchmarks. Now, we need a way to +know whether a failure was due to some regression, or it is due to flakiness +of jobs itself or temporary service outage. In this case, simply retrying `n` +number of times does not seem to help with optimal resource consumption. + +### Goals + +1. Explore both the merits and demerits in having a new mechanism for on-demand + retrying, an _only a failed_ pipeline. +2. A pipeline may either have failed due to some failures in the tasks or may + be user invoked cancel request. Retry only the failed/canceled tasks for a + failed `pipelineRun`. + +### Non-Goals + +1. Retry of successful pipeline runs or anything other than a failed pipeline/task + run. +2. Changing existing retry mechanism. +3. Manage checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s + state stored in etcd is used as is. +4. Determine, a failed tasks dependencies i.e. figuring out what + all dependent tasks are needed to rerun the failed task. + +### Use Cases (optional) + +1. `PipelineRun` can be very resource consuming, and are sometimes susceptible to + fail due to transient conditions. For example, due to service outage of a + particular service. In such cases, it is not enough to be retried `n` times, + a manual invocation of retry is required. + +2. It will be possible to cancel (e.g. preemption) any running `PipelineRun`, and + resume at a later point. + +3. In [Kubeflow pipelines with tekton backend] we are running the pipeline again + with a new `pipelineRun`. One of the main problems we see is that some users + might use `pipelineRun.uid` and `pipelineRun.name` to distinguish their jobs. + So we want a retry feature that can keep these Tekton context variables the + same (like `pipelineRun.name` and `pipelineRun.uid`) when users retry the same job. + +## Requirements + +1. On retry, we would want to reuse the exact same `pipelineRun`, rather than creating + a new one and may be deleting the old one. This is because, our users use + `pipelineRun.uid` and `pipelineRun.name` to distinguish their jobs. + +## Proposal + +When a `PipelineTask` configured with retries fails, in order to retry it resets +the status and start time for that task so that it can begin again. + +On-demand invocation, can take place by signalling failed `PipelineRun` to +`retry`. On receiving that signal `pipelinerun` controller will begin to retry +by resetting the status of the failed task. Apart from the signalling part, this +is same as current implementation of retry. + +### Notes/Caveats (optional) + +1. What happens if the pipeline has finally tasks that do the cleanup ? + + For example, at the clean-up step in finally, a cluster is deleted. For + cases, such as this, the pipeline author can define his pipeline and not + support a manual retry. Or, if the support is a requirement, then redesign + the finally-task such that the clean-up is not done if the pipeline failed. + +2. What happens if the failed task, depends on the side effect of another task. + e.g. In case of a simple pipeline `(A) ---> (B)`, (A) may create some + "side effect" state in the test cluster that will not be there if we execute + (B) alone. To overcome these challenges, we could implement this as a kind of + `opt-in` behaviour, a pipeline or task author will have the ability to + define, his task or pipeline supports a `retry`. + +### Risks and Mitigations + +There are some risk associated with retrying non-idempotent tasks. Risk exists +with both `on-demand` invocation of retry and `retries` count configured. + +Argo mitigates this risk by not supporting finally task for retrying. + +We can mitigate by introducing an opt-in behaviour i.e. tasks declared as +non-idempotent will not be retried. + +### User Experience (optional) + +This support can extend to `tkn` CLI as well. However, it is out of scope of this TEP. + +For example, + +`tkn pipelinerun retry pr-name -n namespace-name` + +### Performance (optional) + + + +## Design Details + + + +## Test Plan + + + +## Design Evaluation + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (optional) + + + +## Upgrade & Migration Strategy (optional) + + + +## References (optional) + +[Kubeflow pipelines with tekton backend](https://github.com/kubeflow/kfp-tekton) \ No newline at end of file diff --git a/teps/README.md b/teps/README.md index 11e1b3745..78e1395f7 100644 --- a/teps/README.md +++ b/teps/README.md @@ -212,6 +212,7 @@ This is the complete list of Tekton teps: |[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implemented | 2021-05-26 | |[TEP-0062](0062-catalog-tags-and-hub-categories-management.md) | Catalog Tags and Hub Categories Management | implementable | 2021-03-30 | |[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 | +|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 | |[TEP-0066](0066-dogfooding-tekton.md) | Dogfooding Tekton | proposed | 2021-05-16 | |[TEP-0067](0067-tekton-catalog-pipeline-organization.md) | Tekton Catalog Pipeline Organization | implementable | 2021-02-22 | |[TEP-0070](0070-tekton-catalog-task-platform-support.md) | Platform support in Tekton catalog | proposed | 2021-06-02 |