TEP-0065: Retry failed tasks on demand in a pipeline

KFP's use case. Co-authored-by: Tommy Li <[email protected]>
tektoncd · May 11, 2021 · d62425e · d62425e
1 parent ccad1e4
commit d62425e
Show file tree

Hide file tree

Showing 2 changed files with 252 additions and 0 deletions.
diff --git a/teps/0065-retry-failed-tasks-on-demand.md b/teps/0065-retry-failed-tasks-on-demand.md
@@ -0,0 +1,251 @@
+---
+status: proposed
+title: Retry failed tasks on-demand in a pipeline
+creation-date: '2021-05-07'
+last-updated: '2021-05-07'
+authors:
+- '@Tomcli'
+- '@ScrapCodes'
+---
+
+# TEP-0065: Retry failed tasks on-demand in a pipeline
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+    - [Use Cases (optional)](#use-cases-optional)
+- [Requirements](#requirements)
+- [Proposal](#proposal)
+    - [Notes/Caveats (optional)](#notescaveats-optional)
+    - [Risks and Mitigations](#risks-and-mitigations)
+    - [User Experience (optional)](#user-experience-optional)
+    - [Performance (optional)](#performance-optional)
+- [Design Details](#design-details)
+- [Test Plan](#test-plan)
+- [Design Evaluation](#design-evaluation)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed (optional)](#infrastructure-needed-optional)
+- [Upgrade &amp; Migration Strategy (optional)](#upgrade--migration-strategy-optional)
+- [References (optional)](#references-optional)
+<!-- /toc -->
+
+## Summary
+
+Presently, a pipeline has a mechanism for `retry` a task, which a pipeline
+author can configure at the time of creation of a `Pipeline` or a
+`PipelineRun`. In this TEP, we are exploring the benefits of adding a new
+mechanism `retry` which will allow a user to "on-demand", retry a failed
+pipeline run. A failed `pipelineRun` may have some or all tasks failed, then
+a retry would make only the failed tasks run again, the successfully
+completed tasks are skipped.
+
+This will be an opt-in behaviour for pipeline and tasks, a pipeline or
+a task author will be able to define that his pipeline or task does
+support a retry or not.
+
+## Motivation
+
+**Optimise the use of cluster resources.**
+
+_Why do we need a new `retry` mechanism when we already support retry in
+`Pipeline` tasks?_ 
+
+The present `retry` field can only be defined at the time of creation of
+pipeline and not, as an `on-demand` invocation? This is not suitable for use
+cases, where a manual intervention is necessary to decide whether a rerun is
+required. For example, if Pipeline were to represent a  CI/CD job, then tasks
+represent test suit, stress test and benchmarks. Now, we need a way to know
+whether a failure was due to some regression, or it is due to flakiness of
+jobs itself. In this case, simply retrying `n` number of times does not seem to
+help with optimal resource consumption.
+
+In reality, a bunch of CI/CD job is not represented by a single Pipeline,
+due to current limitations, for example a single failure should not fail the
+entire `pipelineRun` [TEP-0050](0050-ignore-task-failures.md) and it is not
+possible to retry a single task of a pipeline, also github requirements.
+
+Ability to `retry` failed tasks is of even greater importance to those using
+tekton as a backend for running Machine learning pipelines. A machine learning
+pipeline may consist of tasks moving large amount of data and then training ml
+models, all of it can be very resource consuming and inability to retry would
+require a user to start the entire pipeline over. Sometimes, the failure could
+be due to temporary service outages. A retry after some time could easily fix
+it.
+
+### Goals
+
+1. Explore both the merits and demerits in having a new mechanism for on-demand
+   retrying, an _only a failed_ pipeline.
+2. A pipeline may either have failed due to some failures in the tasks or may
+   be user invoked cancel request. Retry only the failed/canceled tasks for a
+   failed `pipelineRun`.
+
+### Non-Goals
+
+1. Retry of successful pipeline runs or anything other than a failed pipeline
+   run.
+2. Changing or discussing such a possibility of existing retry mechanism.
+3. Manage checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s
+   state stored in etcd is used as is.
+
+### Use Cases (optional)
+
+1. `PipelineRun` can be very resource consuming, and are sometimes susceptible to
+   fail due to transient conditions. For example, due to service outage of a 
+   particular service. In such cases, it is not enough to be retried `n` times,
+   a manual invocation of retry is required.
+
+2. It will be possible to cancel (e.g. preemption) any running `PipelineRun`, and
+   resume at a later point.
+
+## Requirements
+
+<!--
+Describe constraints on the solution that must be met. Examples might include
+performance characteristics that must be met, specific edge cases that must
+be handled, or user scenarios that will be affected and must be accomodated.
+-->
+
+## Proposal
+
+<!--
+This is where we get down to the specifics of what the proposal actually is.
+This should have enough detail that reviewers can understand exactly what
+you're proposing, but should not include things like API designs or
+implementation.  The "Design Details" section below is for the real
+nitty-gritty.
+-->
+
+### Notes/Caveats (optional)
+
+<!--
+What are the caveats to the proposal?
+What are some important details that didn't come across above.
+Go in to as much detail as necessary here.
+This might be a good place to talk about core concepts and how they relate.
+-->
+1. What happens if the pipeline has finally tasks that do the cleanup ?
+   If such a pipeline is retried, then it could be that failed task would fail again.
+
+2. What happens if the failed task, depends on the side of another task and
+   In case of a simple pipeline `(A) ---> (B)`, (A) may create some "side-effect"
+   state in the test cluster that will not be there if we execute (B) alone. 
+
+To overcome these challenges, we could implement this as a kind of `opt-in`
+behaviour, a pipeline or task author will have the ability to define, his task or
+pipeline supports a `retry`.
+
+### Risks and Mitigations
+
+<!--
+What are the risks of this proposal and how do we mitigate. Think broadly.
+For example, consider both security and how this will impact the larger
+kubernetes ecosystem.
+
+How will security be reviewed and by whom?
+
+How will UX be reviewed and by whom?
+
+Consider including folks that also work outside the WGs or subproject.
+-->
+
+### User Experience (optional)
+
+<!--
+Consideration about the user experience. Depending on the area of change,
+users may be task and pipeline editors, they may trigger task and pipeline
+runs or they may be responsible for monitoring the execution of runs,
+via CLI, dashboard or a monitoring system.
+
+Consider including folks that also work on CLI and dashboard.
+-->
+
+### Performance (optional)
+
+<!--
+Consideration about performance.
+What impact does this change have on the start-up time and execution time
+of task and pipeline runs? What impact does it have on the resource footprint
+of Tekton controllers as well as task and pipeline runs?
+
+Consider which use cases are impacted by this change and what are their
+performance requirements.
+-->
+
+## Design Details
+
+<!--
+This section should contain enough information that the specifics of your
+change are understandable.  This may include API specs (though not always
+required) or even code snippets.  If there's any ambiguity about HOW your
+proposal will be implemented, this is the place to discuss them.
+
+If it's helpful to include workflow diagrams or any other related images,
+add them under "/teps/images/". It's upto the TEP author to choose the name
+of the file, but general guidance is to include at least TEP number in the
+file name, for example, "/teps/images/NNNN-workflow.jpg".
+-->
+
+## Test Plan
+
+<!--
+**Note:** *Not required until targeted at a release.*
+
+Consider the following in developing a test plan for this enhancement:
+- Will there be e2e and integration tests, in addition to unit tests?
+- How will it be tested in isolation vs with other components?
+
+No need to outline all of the test cases, just the general strategy.  Anything
+that would count as tricky in the implementation and anything particularly
+challenging to test should be called out.
+
+All code is expected to have adequate tests (eventually with coverage
+expectations).
+-->
+
+## Design Evaluation
+<!--
+How does this proposal affect the reusability, simplicity, flexibility 
+and conformance of Tekton, as described in [design principles](https://github.com/tektoncd/community/blob/master/design-principles.md)
+-->
+
+## Drawbacks
+
+<!--
+Why should this TEP _not_ be implemented?
+-->
+
+## Alternatives
+
+<!--
+What other approaches did you consider and why did you rule them out?  These do
+not need to be as detailed as the proposal, but should include enough
+information to express the idea and why it was not acceptable.
+-->
+
+## Infrastructure Needed (optional)
+
+<!--
+Use this section if you need things from the project/SIG.  Examples include a
+new subproject, repos requested, github details.  Listing these here allows a
+SIG to get the process for these resources started right away.
+-->
+
+## Upgrade & Migration Strategy (optional)
+
+<!--
+Use this section to detail wether this feature needs an upgrade or
+migration strategy. This is especially useful when we modify a
+behavior or add a feature that may replace and deprecate a current one.
+-->
+
+## References (optional)
+
+<!--
+Use this section to add links to GitHub issues, other TEPs, design docs in Tekton
+shared drive, examples, etc. This is useful to refer back to any other related links
+to get more details.
+-->
diff --git a/teps/README.md b/teps/README.md
@@ -185,3 +185,4 @@ This is the complete list of Tekton teps:
 |[TEP-0059](0059-skip-guarded-task-only.md) | Skip Guarded Task Only | proposed | 2021-03-24 |
 |[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implementable | 2021-04-28 |
 |[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 |
+|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 |