TEP-0065: Retry failed tasks on demand in a pipeline #422

ScrapCodes · 2021-05-07T10:46:32Z

Proposing only the problem statement.

tekton-robot · 2021-05-07T10:46:34Z

@ScrapCodes: GitHub didn't allow me to request PR reviews from the following users: Tomcli.

Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Proposing only the problem statement.

/cc @Tomcli @afrittoli

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Tomcli

Thanks @ScrapCodes I updated the use cases to give a little bit more background for KFP.

teps/0065-retry-failed-tasks-on-demand.md

vdemeester

@ScrapCodes Thanks for the TEP. This is a really tricky problem/subject and one that vaires a lot depending on the users and the pipeline they run.

Running naively only the Task in a Pipeline that failed is a really britle and error-prone I think.

Let's take a example, and let's say I have a pipeline that build an image from source-code, provision a cluster, push the image to the cluster internal registry, run some test, and clean the cluster at the end (in a finaly). If the "run some test" task failed, re-running only that task would fail instantaneously as, the cluster is gone, we did not provision a new one, …

How do we know which task we need to re-run in addition to the failed one ?
We could think of adding a way to declare such required dependency, so that we can re-run all the required task before the one that failed and that we want to re-run. This, however, add more complexity and verbosity to the API, for something that "might" happen from time to time.

Taking the example of kubernetes/kubernetes (from the TEP), I understand the need, but as commented, the /retest re-running only the failed jobs is something that is handled on its own, by a specific component. In addition, to compare prow's /retest with Tekton we need to make some assumption: /retest re-run jobs (failed one) with the assumption (it's a given) that a job stands on its own, aka does not depend on anything else than itself. If we were to compare, each prow job should be seen either as a single TaskRun (running one-off task) or a PipelineRun (running a pipeline). And thus, one kubernetes/kubernetes, if you have 10 jobs (that generate 10 ci status), you would have 10 pipelines ; re-running the failed one works just fine as each pipeline are standing on their own. Even better, the component that handles /rerun for prow could be re-used for tekton pipelines as well (and even if not using prow, a component that handle this, could work with a lot of different construct in addition to tekton pipelines).

vdemeester · 2021-05-10T10:10:13Z

teps/0065-retry-failed-tasks-on-demand.md

+TEP, we are exploring the benefits of adding a new API `retry` which will allow a
+user to "on-demand", retry a failed pipeline run. In other words, 
+`rerun failed tests` from CI/CD world.


What do we mean by a new API here ? In tektoncd/pipeline or tektoncd/triggers, … we don't really control what we expose, it's a standard REST API, completely managed by k8s api server. In this TEP, do we want to create a new component or do we want to add new fields to the current object in the current components ?

At this point, we would like to focus on the problem statement itself, so will change this. It is not yet clear, what it would look like yet, it can be covered as part of proposal/design discussion.

We can implement it similar to "cancel" and "pending" on pipelinerun where we change the spec.status from failed to retry to trigger this action.

vdemeester · 2021-05-10T10:13:30Z

teps/0065-retry-failed-tasks-on-demand.md

+For example, At present `/retest` at kubernetes/kubernetes repo reruns only
+the failed jobs, a new api for retrying failed `pipelineRun` will give out
+of the box support. Hope they use `tektoncd` as backend for their CI at some
+point.


Note that this is achieve in kubernetes/kubernetes (and in any other project using prow ) by a prow plugin, and not directly by the "controller" of the jobs. What this means is that it is not defined in any job configuration by the user.

vdemeester · 2021-05-10T10:27:16Z

teps/0065-retry-failed-tasks-on-demand.md

+Without this support, at present a pull-request author or reviewer has to
+individually, mark tests for rerun.


I don't really follow this too. As commented a bit, this is something that can (should probably) be handled externally from pipeline.

Will change this as per feedback !

afrittoli

Thanks for this TEP!
I think it's a very interesting discussion to have, and things discussed here might be relevant for other features too, such as checkpointing of pipelines partial state and partial execution of pipelines.
Some comments inline.

afrittoli · 2021-05-08T11:52:19Z

teps/0065-retry-failed-tasks-on-demand.md

+For example, At present `/retest` at kubernetes/kubernetes repo reruns only
+the failed jobs, a new api for retrying failed `pipelineRun` will give out
+of the box support. Hope they use `tektoncd` as backend for their CI at some
+point.


We do use Tekton to run some of our CI job in Tekton today.
And we do support retest of a single CI job through Tekton today.

A failure in any task in a pipeline today causes the entire pipeline to fail and to stop, so we need dedicated pipelines for each CI jobs to keep the CI jobs independent from each other. Besides, this approach helps with the notifications to GitHub.

This proposal might work nicely together with TEP 0050 - task failure, since that would allow a failed task to not fail an entire pipeline.

afrittoli · 2021-05-08T11:56:57Z

teps/0065-retry-failed-tasks-on-demand.md

+2. A pipeline may either have failed due to some failures in the tasks or may
+   be user invoked cancel request. Retry only the failed/canceled tasks for a
+   failed `pipelineRun`.
+3. Document the feature in the tekton documentation, explaining the use cases.


Documentation is definitely something we need to do, and should be part of the TEP.
I don't think it's a goal of this work though?

will remove redundant goal

afrittoli · 2021-05-08T12:00:06Z

teps/0065-retry-failed-tasks-on-demand.md

+
+Presently, a pipeline has a mechanism for `retry` a task, which a pipeline author
+can configure at the time of creation of a `Pipeline` or a `PipelineRun`. In this
+TEP, we are exploring the benefits of adding a new API `retry` which will allow a


NIT: Whether this will require a new API or not feels like a design decision / implementation detail. I think we should focus first on identifying the functionality that we want to provide, goals and use cases.

Will fix this, instead of explicitly stating API, we can say mechanism. And at the time of design discussion, we can define what that mechanism will look like precisely.

afrittoli · 2021-05-08T12:02:16Z

teps/0065-retry-failed-tasks-on-demand.md

+user to "on-demand", retry a failed pipeline run. In other words, 
+`rerun failed tests` from CI/CD world.


I'm not sure this is 100% fitting with the way we run pipelines in Tekton.
If a task in a pipeline fails, all tasks currently running finish running and then no new task is scheduled.

In this example:

A (ok) ---> B (skipped) C (failed) ---> D (skipped)

My expectation for "retry failed pipeline" would be that (A) is skipped because it was successful before, so (B) and (C) start right away, and (D) is executed if (C) is successful.

afrittoli · 2021-05-10T10:21:15Z

teps/0065-retry-failed-tasks-on-demand.md

+to flakiness of jobs itself. In this case, simply retrying `n` number of times
+does not seem to help with optimal resource consumption.


For CI use cases at least, unless we're 100% sure of that the different parts of the pipeline are perfectly independent, retrying a part of a pipeline might be a not a good practise, as it could lead to success for a pipeline that would normally have failed.

In case of a simple pipeline (A) ---> (B), (A) may create some "side-effect" state in the test cluster that will not be there if we execute (B) alone.

I think this would be very useful for pipelines that tasks that consume a lot of resources (CPU, network bandwidth or else). I suspect this might be implemented as an opt-in behaviour (even in the long run) for pipelines, as persisting the intermediate pipeline status might be expensive, both in terms of execution time as well as storage.

retrying a part of a pipeline might be a not a good practise, as it could lead to success for a pipeline that would normally have failed.

This could happen, if the pipeline author writes tasks unaware that they could be retried on a failure later. E.g. in your example, if (B) has a if check, that relied on the side effect that (A) produced and if A's side effect has somehow vanished. Then (B)'s course of execution that leads to success could be changed.

It is possible, that new course of action is desirable by the actor who retried. In other words, the responsibility of outcome, is shared between, the author of the pipeline/task and the actor who retried.

tektoncd's liability is limited to naively retrying failed tasks, if that leads to failure due to missing dependencies etc.. then, those errors are reported as is.

Therefore, opt-in behaviour makes perfect sense.

afrittoli · 2021-05-10T10:23:24Z

teps/0065-retry-failed-tasks-on-demand.md

+   Allow setting the status of failed pipeline to something like:
+   [TEP-0015 Pending](0015-pending-pipeline.md).


Could you elaborate a bit more on this? I'm not sure understand how this would work?

This is more suited for our discussion during design, may be we can drop this section for now (i.e. current stage of discussing the problem statement).

afrittoli · 2021-05-10T10:25:07Z

teps/0065-retry-failed-tasks-on-demand.md

+1. Retry of successful pipeline runs or anything other than a failed pipeline
+   run.


I suspect that implementing this will require us to implement features like checkpointing of a pipeline, that might be used to re-run a pipeline or parts of a pipeline even if it did not fail, but I think it's fine to restrict the focus of the TEP to the failed case only.

Focus of our TEP is retry failed tasks only.

afrittoli · 2021-05-10T10:49:24Z

teps/0065-retry-failed-tasks-on-demand.md

+3. Discuss checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s
+   state stored in etcd is used as is.


For this feature to work, the pipeline and taskruns in the retry execution will need access to the state of the previous run, so in case any of the state is deleted, the re-run will fail:

writable workspaces needs to be still available:

those provisioned by tekton will stick around until the pipelinerun is not deleted

those provided by users are not controller by tekton

results will be store in etcd in the original pipelinerun

Even workspaces + results might not be enough though, since the initial tasks or a pipeline might setup state outside of the workspace (e.g. create a test cluster, provision a test service), and that state will be usually cleaned-up by the finally part of the pipeline. If the URI of the resources provisioned by initial tasks is exposed as results, those URIs can be passed to the new execution and if still valid the new execution could succeed.

We put checkpoint as non-goal for now because we understand storing the state of the pipeline could be very complicated. Thus in this TEP we want to focus on how to manually retry the same pipelinerun just similar to how users cancel the pipelinerun.

If the information in etcd is not enough to retry the pipelinerun, then we may consider adding checkpoint as one of the requirements.

ScrapCodes · 2021-05-10T11:38:06Z

@ScrapCodes Thanks for the TEP. This is a really tricky problem/subject and one that vaires a lot depending on the users and the pipeline they run.

Thank you @vdemeester :)

Running naively only the Task in a Pipeline that failed is a really britle and error-prone I think.

Let's take a example, and let's say I have a pipeline that build an image from source-code, provision a cluster, push the image to the cluster internal registry, run some test, and clean the cluster at the end (in a finaly). If the "run some test" task failed, re-running only that task would fail instantaneously as, the cluster is gone, we did not provision a new one, …

I guess if a pipeline has a clean up in the finally, then rerun is ought to fail with suitable error. And that is expected, as a outcome of attempting a retry.

How do we know which task we need to re-run in addition to the failed one ?
We could think of adding a way to declare such required dependency, so that we can re-run all the required task before the one that failed and that we want to re-run. This, however, add more complexity and verbosity to the API, for something that "might" happen from time to time.

Similarly, at present, a task may miss an input, which was let's say the output of the previous task. Then, it is expected that the retry will fail as well. Idea of having a retry as on-demand, is a user makes a manual judgement and invokes a retry. It is part of Non goal, to manage the state of the failed pipeline by tektoncd.

Taking the example of kubernetes/kubernetes (from the TEP), I understand the need, but as commented, the /retest re-running only the failed jobs is something that is handled on its own, by a specific component. In addition, to compare prow's /retest with Tekton we need to make some assumption: /retest re-run jobs (failed one) with the assumption (it's a given) that a job stands on its own, aka does not depend on anything else than itself. If we were to compare, each prow job should be seen either as a single TaskRun (running one-off task) or a PipelineRun (running a pipeline). And thus, one kubernetes/kubernetes, if you have 10 jobs (that generate 10 ci status), you would have 10 pipelines ; re-running the failed one works just fine as each pipeline are standing on their own. Even better, the component that handles /rerun for prow could be re-used for tekton pipelines as well (and even if not using prow, a component that handle this, could work with a lot of different construct in addition to tekton pipelines).

This is correct, somehow, I could not communicate, that this support is to be used to invoke retry failed tasks manually. And thus, intention is not to make somehow the failed pipeline pass. If for example, a task run is not of the type "one-off", and needs some dependencies to be met before it can begin, then it is well within the reason that it fails again if those dependencies are not met. The idea here is, suppose the failure was merely due to a fetch failure or something trivial, then the entire progress of pipeline which was very resource consuming is not lost. And because such reattempts could be unsuccessful in some cases (perpetually), does it indicate that we should not support retry at all?

Somehow some pipelines are very resource consuming, for example they may be running some stress tests or benchmarks and may or may not have interdependency between each other. The manual invocation of retrying the failed tasks can save us the resources that may have consumed, if the whole thing has to be rerun. For example, after running a set of benchmarks, losing all the progress because there was connect failures while executing the taskRun that posted the result. A retry would fix that. ( A very crude example, just to illustrate the point, actually one would normally add retry for such a step.)

Starting a pipeline all over again from the last run's configuration is already supported by tektoncd cli (tkn pipeline start --use-pipelinerun string ).

ghost · 2021-05-10T12:48:47Z

/kind tep

ScrapCodes · 2021-05-10T14:51:15Z

I will update the TEP soon, on the basis of the discussions above.

chhsia0 · 2021-05-10T16:16:11Z

/assign @afrittoli

chhsia0 · 2021-05-10T16:16:31Z

@vdemeester Would you like to be assigned to this TEP?

tekton-robot · 2021-05-11T10:34:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign afrittoli
You can assign the PR to them by writing /assign @afrittoli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

teps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ScrapCodes · 2021-05-11T11:12:46Z

Hello @afrittoli and @vdemeester , thank you for reviewing the TEP and helping with the feedback, all of that was helpful in steering the direction that this TEP will move.

I have tried to apply the feedback and updated the TEP, please take a look again?

bobcatfish · 2021-05-17T16:18:19Z

/assign

bobcatfish · 2021-05-21T16:34:12Z

hey @ScrapCodes !! I want to +1 to @afrittoli 's point about how this is similar to partial pipeline execution (#422 (review))

Partial pipeline execution (tektoncd/pipeline#50) is one of our oldest issues and is in our roadmap and @jerop previously created a proposal around disabling a task in a pipeline which at the time we tabled b/c not enough folks were convinced that we needed to tackle the underlying problem, but im wondering if being able to partially execute a pipeline would solve the problem you describe. for example imagine a scenario like this:

you run a long running pipeline with "pipelinerun bar", "task foo" fails
via a command line tool (e.g. tkn) you could run something like tkn pipelinerun copy --skip-failed bar or something (design very tbd haha) which would look at the pipelinerun bar and generate a new pipelinerun with the same pipeline but disabling the failed tasks

so TL;DR i agree with the problem statement you are describing, but also im thinking we could include the requirements and use cases from tektoncd/pipeline#50 and the disabling a task in a pipeline proposal (it even mentions a use case in the problem statement which sounds very much like yours!! "When running a very long Pipeline, there may be a transient failure and the user may want to only rerun the Tasks that failed..."), what do you think?

ScrapCodes · 2021-05-22T06:43:59Z

Hello @bobcatfish, thanks for linking that discussion(shame on me, that I was unable to find it), it does seem to be a prior art 😄, and use cases can be borrowed too.

AFAIU, there is one subtle difference in our requirements, when we say partial pipeline execution or checkpointing, it does seem that tekton's liability is slightly more than just naively rerunning a task. That brings in the issue of what happens to all the other things(pods,pvcs,... which may have vanished even if pipeline state is intact) that a tasks depends on, when it is started at a later time than the rest of the pipeline. Managing all that is beyond our scope, because a task can be written in so many ways.

However, in this TEP's case, e.g. a pipeline is running a task which "supports retry" , and it fails. Since, it supports retry the liability is of the "pipeline/task" author to ensure that it can be retried on a failure. In this case, tekton has a limited liability of rerunning the task naively. In this way, we can support a reasonably common use case, without solving a very hard problem.

Do you think this subtle difference make sense?

Disabled task, is an interesting topic, can I suggest it is significant enough to stand on it's own as an independent TEP. I can work on it, if you like it that way. But, by keeping the surface of this TEP a bit constrained, we may support a narrower set of use cases, and at the same time easier(may be quicker, could be included in current release) to get feedback, approval and implement as well. Currently retries are already supported, we just need to support an on-demand invocation of the same. Copying existing pipeline run with some tasks disabled, can this be achieved by skipping the task with some when expression as well? For those not using tkn cli, it can be some extra work to copy all the pipelineRun tasks that failed.

I may be wrong in my understanding, just sharing so that I can be corrected. Since you have already worked and invested a lot of thoughts on this, would you and/or @jerop like to co author this?

afrittoli · 2021-07-02T13:08:35Z

Oh, indeed, we arbitrarily delete history, I did not expect that.
I think the reasoning behind that is that the TaskRun status is stored in the PipelineRun before we reset it.

you mean, history is preserved in RetriesStatus field.

Oh, I see, RetriesStatus is part of the TaskRun itself, I got it wrong again 😅
Thank you for clarifying!

ScrapCodes · 2021-07-02T13:36:54Z

@afrittoli Would like to take another look! Even if your overall stance remain unchanged, you can resolve those comments that are clarified 😄

vdemeester · 2021-07-02T14:04:38Z

I think I'm convinced that it makes sense to introduce a "retry by resuming after failure" feature at the level of a pipelinerun!
[…]
I'm thinking we could still use partial execution to make this happen

I am still not convinced this is at the pipelinerun level we need this, even at all tbh. This feature can be "done" through well written tasks and / or might not even be needed in most cases. Point being, the question we have to ask ourselves is how much do we want to add complexity to the Pipeline/PipelineRun reconciler for what gain ? (how many user will benefit from this, how hard will it be to use this feature, will there be requirement for this to work correctly, …)

How do we ensure we re-run only the tasks that need to be run ?
Does this require task to be idempotent to be able to skip them ?
How can we mark some task as not skipable ?
Can the use cases this TEP solves be handled better with a different design ? multiple pipeline, event driven, … I think the ML use case could be solved that way. On multiple pipeline, event driven, etc. I see a parallel with gating of pipeline for example.
Related to above, are we solving this because we are writing huge pipeline that do a lot (and thus when something is flakey in the middle, all have to run again) ?

The more I think about it, the more I am convinced this could / should be tackle by a higher construct, something like Tekton Workflows #464 for example. I'd rather keep Pipeline as simple and flexible as it can be, and build on top 👼🏼 .

bobcatfish · 2021-07-02T19:41:47Z

Oh, indeed, we arbitrarily delete history, I did not expect that.

haha me too @afrittoli - i was surprised to realize how taskrun retries were implemented XD

Perhaps @joseblas or @bobcatfish or @abayer have more context on the reason behind that choice?

im guessing it was just b/c it was easy - i dont remember realizing at the time how this was actually being implemented. (this isnt the only case of taskrun behavior being added just to support being executed via pipelinerun, i.e. a hidden coupling b/w taskruns and pipelineruns, for example tektoncd/pipeline#1069)

The more I think about it, the more I am convinced this could / should be tackle by a higher construct, something like Tekton Workflows #464 for example.

I like that idea a lot @vdemeester !! maybe implementing this in workflows is the way to go.

For the questions you raised re. how this functionality would work, I'd like to point again toward tektoncd/pipeline#50 and the disabling/partial execution proposal - we'd address these questions by adding an interface for explicitly disabling tasks in a pipeline, with the ability to explicitly provide params and workspaces as needed.

Somehow, I could not understand the benefit of having partial execution as part of the proposal. I can still add it, do you think we can add?

@ScrapCodes the reason we need partial execution is b/c this is the feature that this proposal is requesting.

From my understanding (let me know if this is wrong) this proposal is saying:
a. 'i need a way to execute just part of my pipeline (i.e. the failed tasks and everything after it)'
b. also 'i want to be able to do this within the context of one pipelinerun'

The proposal is combining these together into one feature request, where the functionality in (a) is implicit - i.e. the user says "retry this pipelinerun from the point it fails at" and then gets (a) automatically.

I'm suggesting we start by adding the functionality in (a) as part of the pipelinerun interface - because we're going to need it regardless of whether or not this is a feature of pipelineruns - and if this going to be a feature of a layer above pipelineruns (such as workspaces), we absolutely need it to be explicit.

For example, let's say we want to use this feature for this example pipeline. The pipeline looks kinda like this:

fetch-from-git: provides 'output' workspace with git source
skaffold-unit-tests: uses 'output' workspace
build-skaffold-web & build-skaffold-app run after the unit tests, both also use the workspace and build images, providing IMAGE_DIGEST results
deploy-web & deploy-app use IMAGE_DIGEST results from the build tasks (and also the output workspace)

Let's say we had 'retry from failure' for this pipeline, and build-skaffold-web failed, but build-skaffold-app succeeded. The PipelineRun would have a state like this:

Succeeded tasks: fetch-from-git, skaffold-unit-tests, build-skaffold-app
Failed tasks: build-skaffold-app
Tasks that didn't run: deploy-web (its a bit of a race, if build-skaffold-app was faster than build-skaffold-web, it may have run), deploy-app

What would happen when the task is re-run from the failure point?

The resulting run would only partially execute the pipeline:

it wouldn't run the tasks the succeeded
it would need to resume from a state that looks like the tasks that succeeded had all just completed, then determine what to run next (in this case, build-skaffold-app and probably deploy-web as well)
it would need to provide all of the results + workspace contents that were provided by the successful tasks

Let me know if this still isn't clear: what im trying to say is that the requirements for this feature include the requirements for partial execution, so we're going to need that feature either way.

Tomcli · 2021-07-03T00:57:18Z

@ScrapCodes Maybe we can break it down into multiple smaller TEPs. In our original TEP, one of the key features we want to get out of this TEP is to retry individual task on demand in a pipeline. It seems like the current task retry is not a very good implementation, so we can use a new TEP to focus on that.

As of some other features like partial pipeline execution would be helpful but it could get very complicated on figure out the correct I/O if the partial pipeline is running on a new pipelineRun. So I would suggest let's break it down into these steps.

TEP for how to make individual task to retry on demand. Also figure out what should be the proper way to retry a task.
TEP for updating the pipelinerun reconciler to resume the pipeline status from failed to running if a taskRun is retried on demand.
TEP for partial pipeline
TEP for pipeline retry where the pipeline controller is responsible for figuring out which failed tasks need to be trigger retry.

From KFP-Tekton, we care more about 1 and 2 from above because most of our use cases are to retry one or two tasks on demand. 3 and 4 is more like nice to have to improve the user experience, but we don't really need it in most scenarios.

ScrapCodes · 2021-07-08T14:57:23Z

My understanding is to begin with 3. and then work on 1 & 2. Soon, I will start the TEP for it.

Thanks @jerop and @bobcatfish, since you have already worked on it, would you like to co-author / author? (@bobcatfish has already indicated her preference.) 😄

bobcatfish · 2021-07-08T18:14:39Z

Sounds like a good approach to me @ScrapCodes !! @Tomcli I'm a bit less clear on definitely pursuing 1 + 2 as you described but like you said maybe they warrant their own TEPs and I'll be able to understand the need a bit better if we pursue them separately.

@ScrapCodes I'm very happy to co-author as long as I'm not driving it haha XD If you want to collaborate in a Google doc or via hackmd I'm happy to contribute!

ScrapCodes · 2021-07-09T11:14:54Z

Added comments to: Disabling a Task in a Pipeline

Tomcli · 2021-07-09T19:13:19Z

Sounds like a good approach to me @ScrapCodes !! @Tomcli I'm a bit less clear on definitely pursuing 1 + 2 as you described but like you said maybe they warrant their own TEPs and I'll be able to understand the need a bit better if we pursue them separately.

@ScrapCodes I'm very happy to co-author as long as I'm not driving it haha XD If you want to collaborate in a Google doc or via hackmd I'm happy to contribute!

Thanks @bobcatfish, 1 + 2 is the building blocks for how to retry a taskRun on demand. There are some discussions on how the taskRun retry could be done better, so we think is better to discuss that on a separate TEP.

Once we agreed on how taskRun retry should be done and the corresponding reconcile logics, we can apply them to 3 + 4 since pipelineRun retry needs to retry the failed taskRun on demand.

jerop · 2021-07-26T15:20:07Z

@ScrapCodes @Tomcli is this TEP on hold as we explore TEP-0077: Partial pipeline execution, or are you pursuing both of them separately?

afrittoli · 2021-07-26T16:07:56Z

/hold

Tomcli · 2021-07-26T16:54:42Z

@ScrapCodes @Tomcli is this TEP on hold as we explore TEP-0077: Partial pipeline execution, or are you pursuing both of them separately?

Yes we are breaking down this TEP into smaller pieces, so we can keep the TEP on hold until we solve all the prerequisites.

pritidesai · 2021-08-02T16:16:59Z

on hold while we persue TEP-0077

tekton-robot · 2021-10-31T16:18:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-11-30T16:31:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

dibyom · 2021-12-06T17:13:12Z

@ScrapCodes are you still actively working on this TEP? Or should we close this for now and re-open it later?

ScrapCodes · 2021-12-09T09:36:39Z

@dibyom , I am currently not pursuing this TEP. Happy to close it for now.

tekton-robot requested a review from afrittoli May 7, 2021 10:46

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 7, 2021

ScrapCodes force-pushed the tep-65 branch 4 times, most recently from a32168c to 2721ae0 Compare May 7, 2021 12:30

Tomcli reviewed May 7, 2021

View reviewed changes

teps/0065-retry-failed-tasks-on-demand.md Outdated Show resolved Hide resolved

teps/0065-retry-failed-tasks-on-demand.md Outdated Show resolved Hide resolved

teps/0065-retry-failed-tasks-on-demand.md Outdated Show resolved Hide resolved

vdemeester reviewed May 10, 2021

View reviewed changes

afrittoli reviewed May 10, 2021

View reviewed changes

tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label May 10, 2021

ScrapCodes force-pushed the tep-65 branch from 9d849a9 to c2c7bae Compare May 10, 2021 14:39

tekton-robot assigned afrittoli May 10, 2021

ScrapCodes force-pushed the tep-65 branch from c2c7bae to 13f6f2d Compare May 11, 2021 10:34

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2021

ScrapCodes force-pushed the tep-65 branch from 13f6f2d to 7841b91 Compare May 11, 2021 10:37

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2021

ScrapCodes force-pushed the tep-65 branch from 7841b91 to d62425e Compare May 11, 2021 10:39

ScrapCodes force-pushed the tep-65 branch from d62425e to 7bac645 Compare May 17, 2021 05:56

tekton-robot assigned bobcatfish May 17, 2021

ScrapCodes force-pushed the tep-65 branch from ed2987c to bee2cf1 Compare July 2, 2021 13:17

Review feedback.

4029214

ScrapCodes force-pushed the tep-65 branch from bee2cf1 to 4029214 Compare July 2, 2021 13:29

bobcatfish mentioned this pull request Jul 20, 2021

TEP-0069: Support retries for custom task in a pipeline. #441

Merged

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2021

bobcatfish mentioned this pull request Aug 10, 2021

TEP-0077: Partial pipeline execute. #484

Closed

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2021

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 30, 2021

ScrapCodes closed this Dec 9, 2021

jerop mentioned this pull request Feb 13, 2023

Feature to rerun task tektoncd/pipeline#6152

Closed

		Without this support, at present a pull-request author or reviewer has to
		individually, mark tests for rerun.

		user to "on-demand", retry a failed pipeline run. In other words,
		`rerun failed tests` from CI/CD world.

		to flakiness of jobs itself. In this case, simply retrying `n` number of times
		does not seem to help with optimal resource consumption.

		Allow setting the status of failed pipeline to something like:
		[TEP-0015 Pending](0015-pending-pipeline.md).

		1. Retry of successful pipeline runs or anything other than a failed pipeline
		run.

		3. Discuss checkpointing of pipeline state or workspaces, etc. A `pipelineRun`'s
		state stored in etcd is used as is.

TEP-0065: Retry failed tasks on demand in a pipeline #422

TEP-0065: Retry failed tasks on demand in a pipeline #422

Conversation

ScrapCodes commented May 7, 2021

tekton-robot commented May 7, 2021

Tomcli left a comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afrittoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScrapCodes commented May 10, 2021 • edited Loading

ghost commented May 10, 2021

ScrapCodes commented May 10, 2021

chhsia0 commented May 10, 2021

chhsia0 commented May 10, 2021

tekton-robot commented May 11, 2021

ScrapCodes commented May 11, 2021

bobcatfish commented May 17, 2021

bobcatfish commented May 21, 2021

ScrapCodes commented May 22, 2021 • edited Loading

afrittoli commented Jul 2, 2021

ScrapCodes commented Jul 2, 2021

vdemeester commented Jul 2, 2021

bobcatfish commented Jul 2, 2021 • edited Loading

Tomcli commented Jul 3, 2021

ScrapCodes commented Jul 8, 2021

bobcatfish commented Jul 8, 2021

ScrapCodes commented Jul 9, 2021 • edited Loading

Tomcli commented Jul 9, 2021

jerop commented Jul 26, 2021

afrittoli commented Jul 26, 2021

Tomcli commented Jul 26, 2021

pritidesai commented Aug 2, 2021

tekton-robot commented Oct 31, 2021

tekton-robot commented Nov 30, 2021

dibyom commented Dec 6, 2021

ScrapCodes commented Dec 9, 2021

ScrapCodes commented May 10, 2021 •

edited

Loading

ScrapCodes commented May 22, 2021 •

edited

Loading

bobcatfish commented Jul 2, 2021 •

edited

Loading

ScrapCodes commented Jul 9, 2021 •

edited

Loading