From a32168c9282b37b43b17edc2e812e252a94caaf1 Mon Sep 17 00:00:00 2001 From: Prashant Sharma Date: Fri, 7 May 2021 16:07:24 +0530 Subject: [PATCH] TEP-0065: Retry failed tasks on demand in a pipeline --- teps/0065-retry-failed-tasks-on-demand.md | 230 ++++++++++++++++++++++ teps/README.md | 1 + 2 files changed, 231 insertions(+) create mode 100644 teps/0065-retry-failed-tasks-on-demand.md diff --git a/teps/0065-retry-failed-tasks-on-demand.md b/teps/0065-retry-failed-tasks-on-demand.md new file mode 100644 index 000000000..a7e264112 --- /dev/null +++ b/teps/0065-retry-failed-tasks-on-demand.md @@ -0,0 +1,230 @@ +--- +status: proposed +title: Retry failed tasks on-demand in a pipeline +creation-date: '2021-05-07' +last-updated: '2021-05-07' +authors: +- '@Tomcli' +- '@ScrapCodes' +--- + +# TEP-0065: Retry failed tasks on-demand in a pipeline + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases (optional)](#use-cases-optional) +- [Requirements](#requirements) +- [Proposal](#proposal) + - [Notes/Caveats (optional)](#notescaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [User Experience (optional)](#user-experience-optional) + - [Performance (optional)](#performance-optional) +- [Design Details](#design-details) +- [Test Plan](#test-plan) +- [Design Evaluation](#design-evaluation) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) +- [Upgrade & Migration Strategy (optional)](#upgrade--migration-strategy-optional) +- [References (optional)](#references-optional) + + +## Summary + +Presently, a pipeline has a mechanism for `retry` a task, which a pipeline author +can configure at the time of creation of a `Pipeline` or a `PipelineRun`. In this +TEP, we are exploring the benefits of adding a new API `retry` which will allow a +user to "on-demand", retry a failed pipeline run. In other words, +`rerun failed tests` from CI/CD world. + +## Motivation + +**Optimise the use of cluster resources.** + +_Why do we need a new `retry` API when we already support retry in `Pipeline` +tasks?_ +The present retry field can only be defined at the time of creation of pipeline +and not, as an `on-demand` invocation? This is not suitable for use cases, where +a manual intervention is necessary to decide whether a rerun is required. For +example a pull request, may have some test suit failures or stress test failures +with known flakes in recent times. Now, we need a way to know whether a failure +was due to some regression in the patch that pull-request proposes, or it is due +to flakiness of jobs itself. In this case, simply retrying `n` number of times +does not seem to help with optimal resource consumption. + +For example, At present `/retest` at kubernetes/kubernetes repo reruns only +the failed jobs, a new api for retrying failed `pipelineRun` will give out +of the box support. Hope they use `tektoncd` as backend for their CI at some +point. + +Without this support, at present a pull-request author or reviewer has to +individually, mark tests for rerun. + +### Goals + +1. Explore both the merits and demerits in having a new API for on-demand + retrying, an _only a failed_ pipeline. +2. A pipeline may either have failed due to some failures in the tasks or may + be user invoked cancel request. Retry only the failed/canceled tasks for a + failed `pipelineRun`. +3. Document the feature in the tekton documentation, explaining the use cases. + +### Non-Goals + +1. Retry of successful pipeline runs or anything other than a failed pipeline + run. +2. Changing or discussing such a possibility of existing retry mechanism. +3. Discuss checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s + state stored in etcd is used as is. + +### Use Cases (optional) + +1. CI/CD use case, manually rerun all the failed jobs for a particular pull + request. +2. As a backend for kubeflow, one would want to manually retry a failed + `pipelineRun` to optimally use the resources. +3. If the pipeline failed due to external factors such as cluster node + failure and image registry disconnect, user can retry from the same + stage at a later time. + +## Requirements + + + +## Proposal + + + +### Notes/Caveats (optional) + + + +### Risks and Mitigations + + + +### User Experience (optional) + + + +### Performance (optional) + + + +## Design Details + + + +## Test Plan + + + +## Design Evaluation + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (optional) + + + +## Upgrade & Migration Strategy (optional) + + + +## References (optional) + + diff --git a/teps/README.md b/teps/README.md index 63ffa8150..6a01b2449 100644 --- a/teps/README.md +++ b/teps/README.md @@ -184,3 +184,4 @@ This is the complete list of Tekton teps: |[TEP-0058](0058-graceful-pipeline-run-termination.md) | Graceful Pipeline Run Termination | proposed | 2021-03-18 | |[TEP-0059](0059-skip-guarded-task-only.md) | Skip Guarded Task Only | proposed | 2021-03-24 | |[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implementable | 2021-04-28 | +|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 |