Skip to content

Commit

Permalink
TEP-0065: Retry failed tasks on demand in a pipeline
Browse files Browse the repository at this point in the history
KFP's use case.

Co-authored-by: Tommy Li <[email protected]>
  • Loading branch information
ScrapCodes and Tomcli committed May 11, 2021
1 parent ccad1e4 commit d62425e
Show file tree
Hide file tree
Showing 2 changed files with 252 additions and 0 deletions.
251 changes: 251 additions & 0 deletions teps/0065-retry-failed-tasks-on-demand.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
---
status: proposed
title: Retry failed tasks on-demand in a pipeline
creation-date: '2021-05-07'
last-updated: '2021-05-07'
authors:
- '@Tomcli'
- '@ScrapCodes'
---

# TEP-0065: Retry failed tasks on-demand in a pipeline

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Cases (optional)](#use-cases-optional)
- [Requirements](#requirements)
- [Proposal](#proposal)
- [Notes/Caveats (optional)](#notescaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [User Experience (optional)](#user-experience-optional)
- [Performance (optional)](#performance-optional)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Design Evaluation](#design-evaluation)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (optional)](#infrastructure-needed-optional)
- [Upgrade &amp; Migration Strategy (optional)](#upgrade--migration-strategy-optional)
- [References (optional)](#references-optional)
<!-- /toc -->

## Summary

Presently, a pipeline has a mechanism for `retry` a task, which a pipeline
author can configure at the time of creation of a `Pipeline` or a
`PipelineRun`. In this TEP, we are exploring the benefits of adding a new
mechanism `retry` which will allow a user to "on-demand", retry a failed
pipeline run. A failed `pipelineRun` may have some or all tasks failed, then
a retry would make only the failed tasks run again, the successfully
completed tasks are skipped.

This will be an opt-in behaviour for pipeline and tasks, a pipeline or
a task author will be able to define that his pipeline or task does
support a retry or not.

## Motivation

**Optimise the use of cluster resources.**

_Why do we need a new `retry` mechanism when we already support retry in
`Pipeline` tasks?_

The present `retry` field can only be defined at the time of creation of
pipeline and not, as an `on-demand` invocation? This is not suitable for use
cases, where a manual intervention is necessary to decide whether a rerun is
required. For example, if Pipeline were to represent a CI/CD job, then tasks
represent test suit, stress test and benchmarks. Now, we need a way to know
whether a failure was due to some regression, or it is due to flakiness of
jobs itself. In this case, simply retrying `n` number of times does not seem to
help with optimal resource consumption.

In reality, a bunch of CI/CD job is not represented by a single Pipeline,
due to current limitations, for example a single failure should not fail the
entire `pipelineRun` [TEP-0050](0050-ignore-task-failures.md) and it is not
possible to retry a single task of a pipeline, also github requirements.

Ability to `retry` failed tasks is of even greater importance to those using
tekton as a backend for running Machine learning pipelines. A machine learning
pipeline may consist of tasks moving large amount of data and then training ml
models, all of it can be very resource consuming and inability to retry would
require a user to start the entire pipeline over. Sometimes, the failure could
be due to temporary service outages. A retry after some time could easily fix
it.

### Goals

1. Explore both the merits and demerits in having a new mechanism for on-demand
retrying, an _only a failed_ pipeline.
2. A pipeline may either have failed due to some failures in the tasks or may
be user invoked cancel request. Retry only the failed/canceled tasks for a
failed `pipelineRun`.

### Non-Goals

1. Retry of successful pipeline runs or anything other than a failed pipeline
run.
2. Changing or discussing such a possibility of existing retry mechanism.
3. Manage checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s
state stored in etcd is used as is.

### Use Cases (optional)

1. `PipelineRun` can be very resource consuming, and are sometimes susceptible to
fail due to transient conditions. For example, due to service outage of a
particular service. In such cases, it is not enough to be retried `n` times,
a manual invocation of retry is required.

2. It will be possible to cancel (e.g. preemption) any running `PipelineRun`, and
resume at a later point.

## Requirements

<!--
Describe constraints on the solution that must be met. Examples might include
performance characteristics that must be met, specific edge cases that must
be handled, or user scenarios that will be affected and must be accomodated.
-->

## Proposal

<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. The "Design Details" section below is for the real
nitty-gritty.
-->

### Notes/Caveats (optional)

<!--
What are the caveats to the proposal?
What are some important details that didn't come across above.
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
1. What happens if the pipeline has finally tasks that do the cleanup ?
If such a pipeline is retried, then it could be that failed task would fail again.

2. What happens if the failed task, depends on the side of another task and
In case of a simple pipeline `(A) ---> (B)`, (A) may create some "side-effect"
state in the test cluster that will not be there if we execute (B) alone.

To overcome these challenges, we could implement this as a kind of `opt-in`
behaviour, a pipeline or task author will have the ability to define, his task or
pipeline supports a `retry`.

### Risks and Mitigations

<!--
What are the risks of this proposal and how do we mitigate. Think broadly.
For example, consider both security and how this will impact the larger
kubernetes ecosystem.
How will security be reviewed and by whom?
How will UX be reviewed and by whom?
Consider including folks that also work outside the WGs or subproject.
-->

### User Experience (optional)

<!--
Consideration about the user experience. Depending on the area of change,
users may be task and pipeline editors, they may trigger task and pipeline
runs or they may be responsible for monitoring the execution of runs,
via CLI, dashboard or a monitoring system.
Consider including folks that also work on CLI and dashboard.
-->

### Performance (optional)

<!--
Consideration about performance.
What impact does this change have on the start-up time and execution time
of task and pipeline runs? What impact does it have on the resource footprint
of Tekton controllers as well as task and pipeline runs?
Consider which use cases are impacted by this change and what are their
performance requirements.
-->

## Design Details

<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
If it's helpful to include workflow diagrams or any other related images,
add them under "/teps/images/". It's upto the TEP author to choose the name
of the file, but general guidance is to include at least TEP number in the
file name, for example, "/teps/images/NNNN-workflow.jpg".
-->

## Test Plan

<!--
**Note:** *Not required until targeted at a release.*
Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.
All code is expected to have adequate tests (eventually with coverage
expectations).
-->

## Design Evaluation
<!--
How does this proposal affect the reusability, simplicity, flexibility
and conformance of Tekton, as described in [design principles](https://github.com/tektoncd/community/blob/master/design-principles.md)
-->

## Drawbacks

<!--
Why should this TEP _not_ be implemented?
-->

## Alternatives

<!--
What other approaches did you consider and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->

## Infrastructure Needed (optional)

<!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, github details. Listing these here allows a
SIG to get the process for these resources started right away.
-->

## Upgrade & Migration Strategy (optional)

<!--
Use this section to detail wether this feature needs an upgrade or
migration strategy. This is especially useful when we modify a
behavior or add a feature that may replace and deprecate a current one.
-->

## References (optional)

<!--
Use this section to add links to GitHub issues, other TEPs, design docs in Tekton
shared drive, examples, etc. This is useful to refer back to any other related links
to get more details.
-->
1 change: 1 addition & 0 deletions teps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,3 +185,4 @@ This is the complete list of Tekton teps:
|[TEP-0059](0059-skip-guarded-task-only.md) | Skip Guarded Task Only | proposed | 2021-03-24 |
|[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implementable | 2021-04-28 |
|[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 |
|[TEP-0065](0065-retry-failed-tasks-on-demand.md) | Retry failed tasks on-demand in a pipeline | proposed | 2021-05-07 |

0 comments on commit d62425e

Please sign in to comment.