-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns #203
Conversation
teps/0015-pause-pipeline.md
Outdated
The primary motivation for this is to support operators who need to control the | ||
start up of PipelineRuns and TaskRuns in clusters that are under heavy load. | ||
By allowing PipelineRuns to be created without actually starting them, one | ||
could implement a custom controller to implement any PipelineRun scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can already implement custom scheduling today at the Kubernetes level, which would let you prioritize Pod startup, which is roughly equivalent to TaskRun startup.
I think I'd want to see more motivation about why we should implement this at the Tekton level as opposed to at the Kubernetes level, where there's already been a ton of work done to support basically anything you might want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do want to use the advanced scheduling options in Kubernetes, in addition to some other pipeline-aware features.
My primary use-case here is to allow us to create a parallelism limit for PipelineRuns. We will also utilize pod priority to order the tasks in the pipelines that do run, which will help somewhat (also this is not something we can do until Kubernetes 1.19, as currently it is not possible to disable pre-emption). But ordering of pods that run and limiting the actual pipelines that are submitted to the scheduler are separate use-cases.
The issue that we are trying to solve is that when you create a new PipelineRun, if there are many other PipelineRuns with complex pipelines you can have many PipelineRuns competing over the same resources - causing the scheduler to be saturated and for individual tasks to take a very long time to start.
For example, if we were to use ResourceQuotas to limit the number of pods in a namespace, we only limit the number of tasks that can run simultaneously, not the number of pipelines.
So we want to be able to limit the number of PipelineRuns that are started at any given time. We do not want to implement the actual parallelism limit at the Tekton level because we want to have custom limits for the PipelineRuns (e.g., based on the namespace or for a particular repository's pipelines), so this feature allows us to submit inert PipelineRuns and then have Tekton act on them once we're ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this design is more complex than I intended and kind of obfuscates my intentions. The only feature that I need here is to be able to create inert PipelineRuns and start them later. The rest about pausing TaskRuns and supporting pausing after the PipelineRun has already been started is only because those features seemed natural to include at the same time for consistency/least surprise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's start with the minimal use case you want to add then when we get to an agreement about it we can discuss how to extend it. Going the other way can lead to unused features that were "easy to add" that add support burden.
My main question is: Why not have whatever system is creating-and-unpausing PipelineRuns just hold off on creating them until they're safe to run? TaskRuns/Pods that don't have schedulable resources will already wait until resources become available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay cool, I'll slim the scope of this document and update it.
My main question is: Why not have whatever system is creating-and-unpausing PipelineRuns just hold off on creating them until they're safe to run? TaskRuns/Pods that don't have schedulable resources will already wait until resources become available.
I tried to address this in the alternatives section, but basically this would require us to have a non-Kubernetes-native data store (say, Redis) that would be a lot harder for our users to have visibility into or to create a CRD that is an abstraction over the PipelineRun, which we are hesitant to do since it would add quite a bit of complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said previously, we need to be able to do our own custom limits, not related strictly to PipelineRun
What kinds of limits are you considering? It sounded like from #203 (comment) it was user-defined concurrency limits (I realize now this presumably means end-users, not Operators, that changes requirements a bit)
If this is supported already I'm certainly happy to use it.
This is actually possible today, I think. You could have a mutating admission webhook force new PipelineRuns into the Succeeded=Unknown
state (so Tekton doesn't try to start them immediately) and a reconciling controller with permission to update PipelineRun statuses to push them into a running state, at which point Tekton's normal reconciliation would take over and start any TaskRuns.
This should be prototypeable today, though I don't think anybody's done anything like it yet. If you're interested in exploring it I'd be happy to help, and if you find any bugs while investigating I'd be happy to fix them.
This is a disadvantage not a benefit.
Eye of the beholder I suppose 😆 . The thought of multiple systems separately judging the same pool of PipelineRuns seems like a real mess to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually possible today, I think. You could have a mutating admission webhook force new PipelineRuns into the Succeeded=Unknown state (so Tekton doesn't try to start them immediately) and a reconciling controller with permission to update PipelineRun statuses to push them into a running state, at which point Tekton's normal reconciliation would take over and start any TaskRuns.
Yeah, I'll try this out and report back. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kinds of limits are you considering? It sounded like from #203 (comment) it was user-defined concurrency limits (I realize now this presumably means end-users, not Operators, that changes requirements a bit)
Depends on use cases, the granularity of applying the limits may not be aligned with namespaces. For example, some may want to set concurrency limits per service account or per "repository."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like it supports this status yet as it immediately runs:
status:
conditions:
- lastTransitionTime: "2020-09-15T18:15:29Z"
message: 'Not started yet.'
reason: ConcurrencyCapped
status: "Unknown"
type: Succeeded
I went ahead and narrowed the scope of the document to just PipelineRuns. I didn't change the API interface yet, but if everyone prefers it as a condition I am happy to change it to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it looks like #3223 uses the PipelineRunPaused
setting to tell Tekton to pause, which sets the Status condition to Paused. I'll change the TEP to that.
Re: Kubernetes' many scheduling options:
There's also tektoncd/pipeline#3052 to discuss customizing Tekton's Pod scheduling, which you could design in such a way to give hooks that could enable someone to implement a "pause"-like functionality. |
Thanks for putting this together and explaining the backgorund @jbarrick-mesosphere ! I have 2 main thoughts:
|
teps/0015-pause-pipeline.md
Outdated
|
||
## Alternatives | ||
|
||
* Instead of adding this setting directly to the PipelineRun or TaskRun object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spit balling here. This might be an opportunity make to make custom interceptor with something like rabbitMQ. So, when a event is made you can broker it yourself. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* PipelineRuns that are not ready to be started could be stored in a
non-Kubernetes-native queue, such as Redis, and only submitted once they are
ready to run. This adds complexity in that a new data store must be run and
it is not easy to provide users visibility into the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should clarify. Tekton triggers can take events and map them to pipelineruns. There a list of interceptors that can manipulate payloads and do extra processing. I agree this approach isn't the best solution. I just think it would be easier to schedule the events rather than pipeline/tasks themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wondering if this is an event issue. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually possible today, I think. You could have a mutating admission webhook force new PipelineRuns into the Succeeded=Unknown state (so Tekton doesn't try to start them immediately) and a reconciling controller with permission to update PipelineRun statuses to push them into a running state, at which point Tekton's normal reconciliation would take over and start any TaskRuns.
This actually just a variation on this comment. nvm. lol
PipelineRuns encodes information that the controller handling concurrencies doesn't need to be aware of and as I pointed out in alternatives we would either have to add new types or introduce a new datastore to accommodate that, both of which add unneeded complexity and make things more complex for the user from a UI perspective.
Our concurrency limit requirements are more complex than should be implemented in Tekton itself, so we need an integration point. |
teps/0015-pause-pipeline.md
Outdated
|
||
This TEP proposes a method for pausing PipelineRuns and TaskRuns that are then | ||
ignored by the Tekton reconciler. These paused resources can be used by platform | ||
implementers to control PipelineRun and TaskRun resources, for example, in cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case, I need the pause
status so that the user can do some manual work after a certain task has been completed. For example, once the preprocessing task is complete, we'd like to pause the pipeline, do some label work manually, and after that we can cancel the pause
status and let the pipeline run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My case is quite different from @jbarrick-mesosphere 's and my PR is based on this "manual work needed" situation 🤔
teps/0015-pause-pipeline.md
Outdated
|
||
PipelineRuns and TaskRuns can be created paused so that they are not started at | ||
all or, in the case of a PipelineRun, the pause setting can be used to prevent | ||
further TaskRuns from being created. As it is not possible to "pause" a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need pause
in PipelineRun
but not TaskRun
since runAfter
only available in PipelineRun
. Once a PipelineRun
is paused, after the completion of the current pipeline, the next pipeline that runs after the previous one will be paused, but other pipelines won't be affected.
teps/0015-pause-pipeline.md
Outdated
|
||
## Alternatives | ||
|
||
* Instead of adding this setting directly to the `PipelineRun` object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any problems if the client-go requeue operation is used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure I understand the question
teps/0015-pause-pipeline.md
Outdated
|
||
### Non-Goals | ||
|
||
* Support pausing of a running `TaskRun` as that would requiring pausing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really need pause taskrun.
We can watch the resource status of taskrun, just like the design of pipelinerun, if taskrunpaused is set, it will be suspended.
The following process:
watch taskrun ---> find where taskrun(Pod) is on the node --->find Pod StepUp(containerID) ---> find processID by containerID---> use daemonset excute “kill -STOP Processid”.
it is so hacking.
what do you think of this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds very hacky and I don't have a use-case for that feature myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chhsia0 has mentioned renaming this particular TEP to something like "Pending" instead of "paused" - if we want to pursue a "pause" feature maybe a separate TEP would make sense?
I agree with @FogDong idea, we only pause pipelinerun. BTW,if you want to complete other advanced feature. which should try to open another pr to perfect it. otherwise I think it will be very complicated. |
ping /cc @imjasonh @bobcatfish |
The purpose of this PR is to provide a way for a custom controller to decide when to start a PipelineRun, so users could implement arbitrary scheduling policy. Both of this PR and #222 are trying to resolve resource contentions in the cluster. It doesn't aim for pausing a running PipelineRun. If there's a need for this, let's open another TEP for discussion. In today's API WG, no one opposes this direction, but the term "Paused" is probably a bit misleading. Let's change to another name, e.g., "Pending" maybe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally in favor of this! And I like @chhsia0 's suggestion to rename it to something like "pending"
I have 1 question: what happens if someone tries to set this value once execution of the Run has started? (I'm thinking we should fail that in the webhook validation, which would prevent this from ever happening)
teps/0015-pause-pipeline.md
Outdated
- "@jbarrick-mesosphere" | ||
creation-date: 2020-09-10 | ||
last-updated: 2020-09-10 | ||
status: draft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually these start as "proposed" or go straight to "implementable"
we can definitely merge as "proposed" pretty quickly since we've discussed this a bunch, if you want to go straight to "implementable" maybe we can get a couple more folks to review @tektoncd/core-maintainers
teps/0015-pause-pipeline.md
Outdated
|
||
### Non-Goals | ||
|
||
* Support pausing of a running `TaskRun` as that would requiring pausing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chhsia0 has mentioned renaming this particular TEP to something like "Pending" instead of "paused" - if we want to pursue a "pause" feature maybe a separate TEP would make sense?
teps/0015-pause-pipeline.md
Outdated
* `PipelineRuns` that are not ready to be started could be stored in a | ||
non-Kubernetes-native queue, such as Redis, and only submitted once they are | ||
ready to run. This adds complexity in that a new data store must be run and | ||
it is not easy to provide users visibility into the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another alternative we could add here is instead of adding a field to spec.status
we add a new condition type, e.g. in addition to condition "succeeded" we have a condition "pending" or something
i think that the current proposal is better because I think it reflects the purpose of spec vs. status better:
- spec = the desired behavior (e.g. don't actually run it)
- status = what is actually happening (e.g. it is not running)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree withe the status Pending
.
Actually the way I achieve Paused
is add PipelinePaused
to spec.status
just like the way we achieve PipelineCancelled
. Ref: tektoncd/pipeline#3223
Maybe I should write another TEP and change the name Pause
to Pending
in my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FogDong tho they are similar I see these as 2 fairly different sets of functionality:
- Pausing Runs = no matter what state the Run currently is in, stop it and don't allow it to progress further (which I think is what Feat: Add pause condition in pipeline run pipeline#3223 is about?)
- Introduce a way to control initial scheduling of runs = Introduce a way to "freeze" Runs before they execute and make it possible for an external system to control when Runs are actually executed (which is what this "pending" TEP is about)
I would see having two different "levers" (e.g. 2 different values in spec.status) to control this. Or even if we use the same spec.status value for both, I think we'd likely approve and move forward with the proposal here, which has a much smaller scope.
To move forward with the pausing functionality we'll probably need to discuss in a further TEP - and especially to understand the use cases around this. (note that a "pause" of sorts could be introduced via a custom task tep 0002
Thanks all, I have updated the TEP to reflect our "pending" setting instead of "pause". I also added a note that the validating webhook should reject updates that set pending after the PipelineRun has started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal looks reasonable to me! I'm happy to approve from my perspective - need to get some more non-google approvals as well if possible @tektoncd/core-maintainers
/approve
|
||
If the `PipelineRunPending` setting is set after a PipelineRun has started, the | ||
validating webhook should reject the update. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth being explicit around how this state relates to the PipelineRun timing out, i.e. does the timeout start counting down while the PipelineRun is still pending or only once it starts running? (I'm guessing it doesn't start until the PipelineRun actually starts running?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pinging this - id like to be explicit about how this interacts with timeouts in the proposal 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts use the .status.startTime
field which is set once the PipelineRun has actually started running, not the PipelineRun creation time, so there is no impact. I have added a note. This is ready for merge now.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
teps/0015-pending-pipeline.md
Outdated
|
||
To start a PipelineRun, the operator should set the `Status` field to empty, | ||
the Tekton controller will see the change and proceed with the Pipeline as | ||
normal. Upon starting, the `Succeeded` `Condition` `Reason` will be changed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more details what "normal" means here?
What is reconciler doing after resuming back? Is there any specific validation reconciler
needs to implement/check to resume back?
Cancellation
is an easy job, cancel task, update status, and be done. But I am guessing resuming
is not. What are the possible pipelineRun states which can allow pipelineRun
going in pending
mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I have tried to clarify this in the document, let me know if it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Justin, its not clear to me but thats fine, its just me, please squash the commits before we can merge. We can discuss further in the implementation PR.
I am having hard time understanding, when can I send a pipelineRun
in pending mode? As soon as a pipelineRun
is instantiated, it initializes the pr.Status.Conditions
with status
set to unknown
and reason to started
with the start time. Can a started
pipeline be assigned pending
state?
I am assuming here the answer is no, since the pipeline is already marked as started and have started validating resources.
After all the validation is done, pr.Status
is set to running
from started
. How is webhook
validating that the pipelinerun has not been created/running?
When a status field is set to empty (to resume from pending state), what state is being assigned to pr.Status
? started
, running
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritidesai if this helps, the webhook will validate before the reconciler does anything
the sequence of events that i understand will be (e.g. imagining you're doing something like kubectl create/apply -f pipelinerun.yaml
or a tool is doing the equivalent):
- make call to API server to create PipelineRun
- API server calls out to any validating/mutating webhooks
- If the webhooks indicate the PipelineRun is invalid, the creation fails and an error is returned to the user
- API server stores the pipelinerun in etcd
- API server notifies any watching connections of the creation of the new pipelinerun
- pipeline controller is notified of new pipelinerun, then starts reconcling
At (6), the controller would see the "pending" spec.status and would update the conditions to "unknown pipelinerun is pending"
Later on, you'd actually start the PipelineRun by removing spec.status.pending. At that point this would happen:
- make call to API server to modify PipelineRun
- API server calls out to any validating/mutating webhook
- Removing spec.status.pending is valid, so the mutation is allowed
- API server stores the pipelinerun in etcd
- API server notifies any watching connections of the modification of the pipelinerun
- pipeline controller is notified of modification, then starts reconcling
Then in (6) the controller would see that "pending" had been removed and know to transition from "unknown pipelinerun is pending" to actually starting to execute the pipeline.
After that point, if anyone tried to modify spec.status.pending, the webhook would see in the status of the PipelineRun that it is already running and deny the mutation; the change would never even be stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a bunch @bobcatfish for the walkthrough 😻
so between 5 and 6, the process watching these connections of creating a new pipelineRun
will have to update the conditions to pending
and looking at the controller logs (also depends on how the monitoring process is designed/implemented and the platform its running on), not much time (in may be part of micro seconds) before a pipelineRun
actually starts running after getting created.
If the PipelineRunPending setting is set after a PipelineRun has been created, the validating webhook should reject the update.
Also, this statement is very confusing, compared to transitioning from 5 to 6, if pipelineRun creation is considered invalid going into pending state, how would the watching process know a pipelineRun
was initiated?
Overall approach seems reasonable. This will be very interesting to see being implemented 😜
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the new service, e.g. let's call it Concurrency Service, will need to add the "pending" label? You're right that the window to do that in would be impossibly small and racy.
Yup, the Concurrency Service having a very small window.
@jbarrick-mesosphere is already saying this but to be totally clear: we cannot rely on a watching service to add the pending label (b/c the window would be so small and racy, if its even possible) - using this feature would require creating PipelineRuns with spec.status.pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch @bobcatfish and @jbarrick-mesosphere
so, run is created in pending mode and watching service is responsible for clearing this label so that the run can be scheduled/executed.
What happens if the pipeline or task definition changes between the time it was sent to pending vs the time it was activated to run? The run continues with the old or new definition? or its discarded saying the definition changed?
When it says the status is set to empty to activate the run, is it the same status
which was used to send it to pending
? In that case, isn't it equivalent to creating a new run from the syntax perspective? of course, the controller will check to create a new run or activate the pending run. I am not objecting this but just making it explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the pipeline or task definition changes between the time it was sent to pending vs the time it was activated to run? The run continues with the old or new definition? or its discarded saying the definition changed?
That's an interesting question - afaik today you can update a PipelineRun in progress and the changes will be picked up and used - e.g. say. you were running a PipelineRun with param X set to "foo", used by Task A, then Task B (which runs after Task A). If, while Task A was running, you changed the value of the param X to "bar", I think Task B would end up using Bar.
I might be wrong but I don't think we have anything in place that would prevent that from happening today, so I think this question applies globally to Run execution and not just to this feature.
In that case, isn't it equivalent to creating a new run from the syntax perspective?
I think it's very similar to "cancelling" a pipeline run, which is also done by modifying the value of spec.status (i.e. we do not treat this as creating a run, we treat this as a way to communicate the desired state of the existing run)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the pipeline or task definition changes between the time it was sent to pending vs the time it was activated to run? The run continues with the old or new definition? or its discarded saying the definition changed?
This is a very good question. Depending on what we wanna do, we could prevent that. At the "creation" time and first "reconciliation" loop, we get the PipelineSpec
(from whatever it comes from) and store it into the status (also known as dereference). We could base the rest of the code and reconciliation loop to look at the status's pipelineSpec (and taskSpec) as source of truth.
This would fix a bunch of potential problems too (that exists today) : what happens when the def of reference change (also applies to bundles), and it would probably reduce api calls (as we would have all we need at hand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong but I don't think we have anything in place that would prevent that from happening today, so I think this question applies globally to Run execution and not just to this feature.
I agree the specified example is not specific to this feature but applies globally. Yup, we work on modified spec
so the changes are picked up. I am more concerned with one scenario with this feature is permanentError
i.e. Task A existed at the time of creating a run in pending mode but at the time executing, if Task A is missing, controller would exit with missingTaskDef
but the pr.Status.Spec
would actually have that definition. This is something very rare and applies more to this feature.
I like what @vdemeester is suggesting to consider pr.Status.Spec
as a source of truth which is more visible in this feature and bundles
(hopefully not fetching bundles on every reconciliation loop, though I am not able to recall how its implemented).
I think it's very similar to "cancelling" a pipeline run, which is also done by modifying the value of spec.status (i.e. we do not treat this as creating a run, we treat this as a way to communicate the desired state of the existing run)
Right, but the existing proposal does not have any special status to cancel pending state and resume executing the run. spec.Status
is set to cancel
to cancel existing run, spec.Status
is set to pending
to create run in pending state, spec.Status to empty ``, to resume pending run?
Also, something more to consider is how long a pipelineRun
can stay in pending
state? Controller can create many such runs and forget about them. What happens if those runs
are never resumed back? This might seem out of scope here but this can fill up etcd
quickly and can impact the controller.
These are all very important and useful discussions and I appreciate all of your patience 🙏
Overall, I would like understand the state transition/movement in the proposal, otherwise looks good. Also, I do not quite understand the benefit of creating pending |
update motivations text Narrow document scope to just PipelineRuns Specify changing the Succeeded Condition's Reason to Paused. change paused to pending Add clarity on state transitions add note about pipelinerun timeouts
57ef595
to
afdd939
Compare
I've put up a PR to implement this here: tektoncd/pipeline#3522 |
This TEP proposes a method for pausing PipelineRuns and TaskRuns that are then
ignored by the Tekton reconciler. These paused resources can be used by platform
implementers to control PipelineRun and TaskRun resources, for example, in cases
where the cluster is under heavy load and not ready to process a PipelineRun.
PipelineRuns and TaskRuns can be created paused so that they are not started at
all or, in the case of a PipelineRun, the pause setting can be used to prevent
further TaskRuns from being created. As it is not possible to "pause" a
Kubernetes pod, the pause setting would not affect TaskRuns after they are
started.