From d62425ea6cf02fd6690fe27099a7e4a7f1423ec2 Mon Sep 17 00:00:00 2001 From: Prashant Sharma Date: Fri, 7 May 2021 16:07:24 +0530 Subject: [PATCH] TEP-0065: Retry failed tasks on demand in a pipeline KFP's use case. Co-authored-by: Tommy Li --- teps/0065-retry-failed-tasks-on-demand.md | 251 ++++++++++++++++++++++ teps/README.md | 1 + 2 files changed, 252 insertions(+) create mode 100644 teps/0065-retry-failed-tasks-on-demand.md diff --git a/teps/0065-retry-failed-tasks-on-demand.md b/teps/0065-retry-failed-tasks-on-demand.md new file mode 100644 index 000000000..40a2ceeae --- /dev/null +++ b/teps/0065-retry-failed-tasks-on-demand.md @@ -0,0 +1,251 @@ +--- +status: proposed +title: Retry failed tasks on-demand in a pipeline +creation-date: '2021-05-07' +last-updated: '2021-05-07' +authors: +- '@Tomcli' +- '@ScrapCodes' +--- + +# TEP-0065: Retry failed tasks on-demand in a pipeline + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases (optional)](#use-cases-optional) +- [Requirements](#requirements) +- [Proposal](#proposal) + - [Notes/Caveats (optional)](#notescaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [User Experience (optional)](#user-experience-optional) + - [Performance (optional)](#performance-optional) +- [Design Details](#design-details) +- [Test Plan](#test-plan) +- [Design Evaluation](#design-evaluation) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) +- [Upgrade & Migration Strategy (optional)](#upgrade--migration-strategy-optional) +- [References (optional)](#references-optional) + + +## Summary + +Presently, a pipeline has a mechanism for `retry` a task, which a pipeline +author can configure at the time of creation of a `Pipeline` or a +`PipelineRun`. In this TEP, we are exploring the benefits of adding a new +mechanism `retry` which will allow a user to "on-demand", retry a failed +pipeline run. A failed `pipelineRun` may have some or all tasks failed, then +a retry would make only the failed tasks run again, the successfully +completed tasks are skipped. + +This will be an opt-in behaviour for pipeline and tasks, a pipeline or +a task author will be able to define that his pipeline or task does +support a retry or not. + +## Motivation + +**Optimise the use of cluster resources.** + +_Why do we need a new `retry` mechanism when we already support retry in +`Pipeline` tasks?_ + +The present `retry` field can only be defined at the time of creation of +pipeline and not, as an `on-demand` invocation? This is not suitable for use +cases, where a manual intervention is necessary to decide whether a rerun is +required. For example, if Pipeline were to represent a CI/CD job, then tasks +represent test suit, stress test and benchmarks. Now, we need a way to know +whether a failure was due to some regression, or it is due to flakiness of +jobs itself. In this case, simply retrying `n` number of times does not seem to +help with optimal resource consumption. + +In reality, a bunch of CI/CD job is not represented by a single Pipeline, +due to current limitations, for example a single failure should not fail the +entire `pipelineRun` [TEP-0050](0050-ignore-task-failures.md) and it is not +possible to retry a single task of a pipeline, also github requirements. + +Ability to `retry` failed tasks is of even greater importance to those using +tekton as a backend for running Machine learning pipelines. A machine learning +pipeline may consist of tasks moving large amount of data and then training ml +models, all of it can be very resource consuming and inability to retry would +require a user to start the entire pipeline over. Sometimes, the failure could +be due to temporary service outages. A retry after some time could easily fix +it. + +### Goals + +1. Explore both the merits and demerits in having a new mechanism for on-demand + retrying, an _only a failed_ pipeline. +2. A pipeline may either have failed due to some failures in the tasks or may + be user invoked cancel request. Retry only the failed/canceled tasks for a + failed `pipelineRun`. + +### Non-Goals + +1. Retry of successful pipeline runs or anything other than a failed pipeline + run. +2. Changing or discussing such a possibility of existing retry mechanism. +3. Manage checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s + state stored in etcd is used as is. + +### Use Cases (optional) + +1. `PipelineRun` can be very resource consuming, and are sometimes susceptible to + fail due to transient conditions. For example, due to service outage of a + particular service. In such cases, it is not enough to be retried `n` times, + a manual invocation of retry is required. + +2. It will be possible to cancel (e.g. preemption) any running `PipelineRun`, and + resume at a later point. + +## Requirements + + + +## Proposal + + + +### Notes/Caveats (optional) + + +1. What happens if the pipeline has finally tasks that do the cleanup ? + If such a pipeline is retried, then it could be that failed task would fail again. + +2. What happens if the failed task, depends on the side of another task and + In case of a simple pipeline `(A) ---> (B)`, (A) may create some "side-effect" + state in the test cluster that will not be there if we execute (B) alone. + +To overcome these challenges, we could implement this as a kind of `opt-in` +behaviour, a pipeline or task author will have the ability to define, his task or +pipeline supports a `retry`. + +### Risks and Mitigations + + + +### User Experience (optional) + + + +### Performance (optional) + + + +## Design Details + + + +## Test Plan + + + +## Design Evaluation + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (optional) + + + +## Upgrade & Migration Strategy (optional) + + + +## References (optional) + + diff --git a/teps/README.md b/teps/README.md index f822e31de..bce28a414 100644 --- a/teps/README.md +++ b/teps/README.md @@ -185,3 +185,4 @@ This is the complete list of Tekton teps: |[TEP-0059](0059-skip-guarded-task-only.md) | Skip Guarded Task Only | proposed | 2021-03-24 | |[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implementable | 2021-04-28 | |[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 | +|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 |