TEP-0065: Retry failed tasks on demand in a pipeline

KFP's use case. Co-authored-by: Tommy Li <[email protected]>
tektoncd · May 17, 2021 · 7bac645 · 7bac645
1 parent ccad1e4
commit 7bac645
Show file tree

Hide file tree

Showing 2 changed files with 263 additions and 0 deletions.
diff --git a/teps/0065-retry-failed-tasks-on-demand.md b/teps/0065-retry-failed-tasks-on-demand.md
@@ -0,0 +1,262 @@
+---
+status: proposed
+title: Retry failed tasks on-demand in a pipeline
+creation-date: '2021-05-07'
+last-updated: '2021-05-07'
+authors:
+- '@Tomcli'
+- '@ScrapCodes'
+---
+
+# TEP-0065: Retry failed tasks on-demand, in a pipeline
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+    - [Use Cases (optional)](#use-cases-optional)
+- [Requirements](#requirements)
+- [Proposal](#proposal)
+    - [Notes/Caveats (optional)](#notescaveats-optional)
+    - [Risks and Mitigations](#risks-and-mitigations)
+    - [User Experience (optional)](#user-experience-optional)
+    - [Performance (optional)](#performance-optional)
+- [Design Details](#design-details)
+- [Test Plan](#test-plan)
+- [Design Evaluation](#design-evaluation)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed (optional)](#infrastructure-needed-optional)
+- [Upgrade &amp; Migration Strategy (optional)](#upgrade--migration-strategy-optional)
+- [References (optional)](#references-optional)
+<!-- /toc -->
+
+## Summary
+
+Presently, a pipeline has a mechanism for `retry`, which a pipeline
+author can configure at the time of creation of a `Pipeline` or a
+`PipelineRun`. In this TEP, we are exploring the benefits of adding a new
+mechanism `retry` which will allow a user to - "on-demand" retry a failed
+`pipelineRun`. A failed `pipelineRun` may have some or all tasks failed, then
+a retry would make only the failed tasks run again, the successfully
+completed tasks are skipped.
+
+This will be an opt-in behaviour for pipeline and tasks, a pipeline or
+a task author will be able to define that his pipeline or task does
+support a retry or not.
+
+## Motivation
+
+**Optimal use of cluster resources.**
+
+Ability to `retry` failed tasks is especially useful, where `tekton` is a
+backend for running Machine learning pipelines. A machine learning pipeline
+may consist of tasks moving large amount of data and then training ml models,
+all of it can be very resource consuming and inability to retry would require
+a user to start the entire pipeline over. Sometimes, the failure could be due
+to temporary service outages. For example, after training the model, a task
+reporting the metrics fails due to temporary service outage. A retry after
+some time could easily fix it.
+
+A pipeline may be defined with various tasks, and some tasks might move a
+large amount of data and incur cost. This `retry` mechanism has substantial
+value, where each task of the pipeline incurs a significant computing resources, 
+e.g. `tekton` is used as a backend for ML pipelines.
+
+_Why do we need a new `retry` mechanism when we already support retry in 
+`Pipeline` tasks?_ 
+
+The present `retry` field can only be defined at the time of creation of
+pipeline. This is not suitable for use cases, where a manual intervention
+is necessary to decide whether a rerun is required or not.
+For example, if a service outage is causing a particular task failure, then
+retrying `n` times, won't help, unless we wait for the service to be back
+again and retry. For such manual interventions, we need on-demand `retry`
+mechanism.
+
+Another concocted example, if `Pipeline` were to represent a  CI/CD job, then
+tasks represent test suit, stress test and benchmarks. Now, we need a way to
+know whether a failure was due to some regression, or it is due to flakiness
+of jobs itself or temporary service outage. In this case, simply retrying `n`
+number of times does not seem to help with optimal resource consumption.
+
+### Goals
+
+1. Explore both the merits and demerits in having a new mechanism for on-demand
+   retrying, an _only a failed_ pipeline.
+2. A pipeline may either have failed due to some failures in the tasks or may
+   be user invoked cancel request. Retry only the failed/canceled tasks for a
+   failed `pipelineRun`.
+
+### Non-Goals
+
+1. Retry of successful pipeline runs or anything other than a failed pipeline/task
+   run.
+2. Changing existing retry mechanism.
+3. Manage checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s
+   state stored in etcd is used as is.
+4. Determine, a failed tasks dependencies i.e. figuring out what
+   all dependent tasks are needed to rerun the failed task.
+
+### Use Cases (optional)
+
+1. `PipelineRun` can be very resource consuming, and are sometimes susceptible to
+   fail due to transient conditions. For example, due to service outage of a 
+   particular service. In such cases, it is not enough to be retried `n` times,
+   a manual invocation of retry is required.
+
+2. It will be possible to cancel (e.g. preemption) any running `PipelineRun`, and
+   resume at a later point.
+
+## Requirements
+
+<!--
+Describe constraints on the solution that must be met. Examples might include
+performance characteristics that must be met, specific edge cases that must
+be handled, or user scenarios that will be affected and must be accomodated.
+-->
+
+## Proposal
+
+<!--
+This is where we get down to the specifics of what the proposal actually is.
+This should have enough detail that reviewers can understand exactly what
+you're proposing, but should not include things like API designs or
+implementation.  The "Design Details" section below is for the real
+nitty-gritty.
+-->
+
+### Notes/Caveats (optional)
+
+<!--
+What are the caveats to the proposal?
+What are some important details that didn't come across above.
+Go in to as much detail as necessary here.
+This might be a good place to talk about core concepts and how they relate.
+-->
+1. What happens if the pipeline has finally tasks that do the cleanup ?
+
+   For example, at the clean-up step in finally, a cluster is deleted. For
+   cases, such as this, the pipeline author can define his pipeline and not
+   support a manual retry. Or, if the support is a requirement, then redesign
+   the finally-task such that the clean-up is not done if the pipeline failed.
+
+2. What happens if the failed task, depends on the side effect of another task.
+   e.g. In case of a simple pipeline `(A) ---> (B)`, (A) may create some
+   "side effect" state in the test cluster that will not be there if we execute
+   (B) alone. To overcome these challenges, we could implement this as a kind of
+   `opt-in` behaviour, a pipeline or task author will have the ability to
+   define, his task or pipeline supports a `retry`.
+
+### Risks and Mitigations
+
+<!--
+What are the risks of this proposal and how do we mitigate. Think broadly.
+For example, consider both security and how this will impact the larger
+kubernetes ecosystem.
+
+How will security be reviewed and by whom?
+
+How will UX be reviewed and by whom?
+
+Consider including folks that also work outside the WGs or subproject.
+-->
+
+### User Experience (optional)
+
+<!--
+Consideration about the user experience. Depending on the area of change,
+users may be task and pipeline editors, they may trigger task and pipeline
+runs or they may be responsible for monitoring the execution of runs,
+via CLI, dashboard or a monitoring system.
+
+Consider including folks that also work on CLI and dashboard.
+-->
+
+### Performance (optional)
+
+<!--
+Consideration about performance.
+What impact does this change have on the start-up time and execution time
+of task and pipeline runs? What impact does it have on the resource footprint
+of Tekton controllers as well as task and pipeline runs?
+
+Consider which use cases are impacted by this change and what are their
+performance requirements.
+-->
+
+## Design Details
+
+<!--
+This section should contain enough information that the specifics of your
+change are understandable.  This may include API specs (though not always
+required) or even code snippets.  If there's any ambiguity about HOW your
+proposal will be implemented, this is the place to discuss them.
+
+If it's helpful to include workflow diagrams or any other related images,
+add them under "/teps/images/". It's upto the TEP author to choose the name
+of the file, but general guidance is to include at least TEP number in the
+file name, for example, "/teps/images/NNNN-workflow.jpg".
+-->
+
+## Test Plan
+
+<!--
+**Note:** *Not required until targeted at a release.*
+
+Consider the following in developing a test plan for this enhancement:
+- Will there be e2e and integration tests, in addition to unit tests?
+- How will it be tested in isolation vs with other components?
+
+No need to outline all of the test cases, just the general strategy.  Anything
+that would count as tricky in the implementation and anything particularly
+challenging to test should be called out.
+
+All code is expected to have adequate tests (eventually with coverage
+expectations).
+-->
+
+## Design Evaluation
+<!--
+How does this proposal affect the reusability, simplicity, flexibility 
+and conformance of Tekton, as described in [design principles](https://github.com/tektoncd/community/blob/master/design-principles.md)
+-->
+
+## Drawbacks
+
+<!--
+Why should this TEP _not_ be implemented?
+-->
+
+## Alternatives
+
+<!--
+What other approaches did you consider and why did you rule them out?  These do
+not need to be as detailed as the proposal, but should include enough
+information to express the idea and why it was not acceptable.
+-->
+
+## Infrastructure Needed (optional)
+
+<!--
+Use this section if you need things from the project/SIG.  Examples include a
+new subproject, repos requested, github details.  Listing these here allows a
+SIG to get the process for these resources started right away.
+-->
+
+## Upgrade & Migration Strategy (optional)
+
+<!--
+Use this section to detail wether this feature needs an upgrade or
+migration strategy. This is especially useful when we modify a
+behavior or add a feature that may replace and deprecate a current one.
+-->
+
+## References (optional)
+
+<!--
+Use this section to add links to GitHub issues, other TEPs, design docs in Tekton
+shared drive, examples, etc. This is useful to refer back to any other related links
+to get more details.
+-->
diff --git a/teps/README.md b/teps/README.md
@@ -185,3 +185,4 @@ This is the complete list of Tekton teps:
 |[TEP-0059](0059-skip-guarded-task-only.md) | Skip Guarded Task Only | proposed | 2021-03-24 |
 |[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implementable | 2021-04-28 |
 |[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 |
+|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 |