-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support to delete completed TaskRun or PipelineRun automatically #1486
Conversation
Add field "ExpirationSecondsTTL" in TaskRunSpec and PipelineRunSpec. If the field "ExpirationSecondsTTL" is set, after the TaskRun finishes, it is eligible to be automatically deleted. When the TaskRun or PipelineRun is being deleted, its lifecycle guarantees will be honored. If this field is unset, the TaskRun won't be automatically deleted. If this field is set to zero, the TaskRun or PipelineRun becomes eligible to be deleted immediately after it finishes. And if taskRun is produced by PipelineRun(checked via ownerReference), this ExpirationSecondsTTL will be ignored, then tekton will check PipelineRun's ExpirationSecondsTTL to determine if PipelineRun should be deleted.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @anxinyf. Thanks for your PR. I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Adding taskrun_expired.go and pipelinerun_expired.go which implement approaches to delete completed TaskRun or PipelineRun via field "ExpirationSecondsTTL".
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for breaking up your PR from #1429!!
Can you describe a bit about what the total set of PRs would look like? One thing I notice from this PR is that it doesn't include the docs and the tests that are present in #1429 - we usually try to include docs and tests with every PR when possible (see development guidelines), so before this particular PR would be mergeable we'd want it to include docs + tests as well.
It's also a bit hard from this PR to get a sense for how some of these pieces would be used, e.g. AddTaskRun
.
I'm wondering about a couple of alternative implementations (which maybe you have already considered!):
- Could we use the same logic that handles timing out Runs?
- What about making a totally new controller instead?
- It looks like the controller lib we're using (knative/pkg) has an
EnqeueKeyAfter
function - this makes me wonder if we could use the same workqueue we're using for everything else to handle cleanup as well? Then it would be the job of our main reconcile function to check if the Run needs to be deleted
|
||
// ListerSynced returns true if the TaskRun store has been synced at least once. | ||
// Added as a member to the struct to allow injection for testing. | ||
ListerSynced cache.InformerSynced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im wondering if there is some way we can avoid adding more attributes and functionality to the reconciler object itself - do you think it might be possible to create a separate object for managing the timeouts, which we could use and test in isolation?
if so then maybe one way to break up #1429 (which might be what you're already going for!) would be to have this PR add that object, complete with unit test coverage, and describe how it is intended to be hooked up (I think this is pretty close to what you're doing already, it's just a bit hard right now to see the overall structure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in #1429, I just do this in a new file ttl_controller.go, but for convenience, I add these new attibutes in Reconciler object. I'll consider it again.
|
||
func (tc *Reconciler) UpdatePipelineRun(old, cur interface{}) { | ||
pr := cur.(*apispipeline.PipelineRun) | ||
klog.V(4).Infof("Updating PipelineRun %s/%s", pr.Namespace, pr.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reconciler object actually has a Logger
object which we usually use for logging, e.g.
pipeline/pkg/reconciler/taskrun/taskrun.go
Line 100 in 1921344
c.Logger.Infof("task run %q in work queue no longer exists", key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, I'll fix it
// its TTL hasn't expired, it will be added to the queue after the TTL is expected | ||
// to expire. | ||
// This function is not meant to be invoked concurrently with the same key. | ||
func (tc *Reconciler) processTaskRunExpired(namespace, name string, tr *apispipeline.TaskRun) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, do you think it's possible to use the same logic for TaskRuns and PipelineRuns, or do they definitely need to be distinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because processTaskRunExpired needs to call HasPipelineRunOwnerReference but processPipelineRunExpired not, and tc.processTrTTL calls taskRunCleanup which is different from pipelineRunCleanup. So I would like to implement this separately :)
return &finishAtUTC, &expireAtUTC, nil | ||
} | ||
|
||
func trTimeLeft(tr *apispipeline.TaskRun, since *time.Time) (*time.Duration, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is broken up into lots of small functions - one thing that we'd definitely want to see in this PR is as much unit test coverage as we can get - the easiest way to do this (which usually leads to the best code!) is to create unit tests for individual functions like these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, It's necessary! In fact, I have test for this, upload soon :)
Thx for review. I have tests and docs, and will add them. About "Could we use the same logic that handles timing out Runs?", About "What about making a totally new controller instead?" About third point, I noticed these functions before, but some may need to be modified, I'll check again. |
1. Add field "ExpirationTime" in PipelineRunStatus and TaskRunStatus to display expired time. 2. Add unit tests file pipelinerun_expired_test.go and taskrun_expired_test.go
1. Add field "ExpirationTime" in PipelineRunStatus and TaskRunStatus to display expired time. 2. Add unit tests file pipelinerun_expired_test.go and taskrun_expired_test.go
3b60df7
to
ae47bb8
Compare
/test tekton-pipeline-unit-tests |
…into tektoncd-master Update unit tests.
ae47bb8
to
84fbae1
Compare
The following is the coverage report on pkg/.
|
…into tektoncd-master Update unit tests.
…into tektoncd-master
The following is the coverage report on pkg/.
|
The following is the coverage report on pkg/.
|
modify some unit test files
The following is the coverage report on pkg/.
|
…ktoncd-master Conflicts: go.mod vendor/modules.txt
…into tektoncd-master Conflicts: pkg/apis/pipeline/v1alpha1/zz_generated.deepcopy.go
The following is the coverage report on pkg/.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I like @bobcatfish's suggestion to extract
c.AddPipelineRun
andc.UpdatePipelineRun
- you'll need to squash your commits 👼
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anxinyf I left a bunch of comments on the code.
As I mentioned here I would still really like to see this feature presented to the working group before we decide to merge it. It would be helpful for the community to discuss it and also for you to present the way that you / your company plan to use this feature once its merged.
As a reminder, here are the instructions for joining the working group to present:
You can add this feature to discuss in the agenda here: https://docs.google.com/document/d/1rPR7m1Oj0ip3bpd_bcS1sjZyPgGi_g9asF5YrExeESc/edit?ts=5c917a0c#heading=h.t0kiqeqw4uo1
If you are not already a member of the tekton-dev google group then you'll first need to join here: https://groups.google.com/forum/#!forum/tekton-dev
And details of joining the working group call are on the tekton-dev calendar, which you can find here: https://github.com/tektoncd/community/blob/master/contact.md#calendar
} | ||
|
||
func (tc *Reconciler) UpdatePipelineRun(old, cur interface{}) { | ||
pr := cur.(*apispipeline.PipelineRun) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should possibly check if this was a successful cast to avoid panics. something like:
pr, ok := cur.(*apispipeline.PipelineRun)
if !ok {
return // maybe log an error?
}
Also, same for AddPipelineRun
on line 18
} else if !expired { | ||
return nil | ||
} | ||
// Cascade deletes the PipelineRuns if TTL truly expires. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "Cascade" in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means foreground cascading deletion, and from Controlling how the garbage collector deletes dependents
return tc.PipelineClientSet.TektonV1alpha1().PipelineRuns(fresh.Namespace).Delete(fresh.Name, options) | ||
} | ||
|
||
// processTTL checks whether a given PipelineRun's TTL has expired, and add it to the queue after the TTL is expected to expire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processTTL
in comment should match function name below it
|
||
// processTTL checks whether a given PipelineRun's TTL has expired, and add it to the queue after the TTL is expected to expire | ||
// if the TTL will expire later. | ||
func (tc *Reconciler) processPrTTL(pr *apispipeline.PipelineRun) (expired bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically initialisms are all uppercase in Go, which would make this processPRTTL
. I suggest processPipelineRunTTL
as maybe a slightly more readable alternative here.
} | ||
|
||
// pipelineRunCleanup checks whether a PipelineRun has succeeded or failed and has a TTL set. | ||
func pipelineRunCleanup(pr *apispipeline.PipelineRun) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the name of this func suggests it will do some cleanup work but it's just a check for whether a pipelinerun needs to have its expiration processed. I suggest isPipelineRunExpiring()
or shouldCleanupPipelineRun()
or a similar such name here.
|
||
func newTaskRun(completionTime, failedTime apis.VolatileTime, ttl *metav1.Duration) *apispipeline.TaskRun { | ||
tr := tb.TaskRun("test-task-run-with-expiration-ttl", "foo", | ||
//tb.TaskRunOwnerReference("PipelineRun", "test-pipeline-run-with-annotations", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove dead code.
expectedTimeLeft *time.Duration | ||
}{ | ||
{ | ||
name: "Error case: TaskRun unfinished", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also suggest splitting this into two test funcs - one for success cases and one for error cases.
@@ -415,6 +415,17 @@ func TaskRunAffinity(affinity *corev1.Affinity) TaskRunSpecOp { | |||
} | |||
} | |||
|
|||
func TaskRunExpirationSecondsTTL(duration time.Duration) TaskRunSpecOp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add func comment here.
tc.Logger.Infof("Updating PipelineRun %s/%s if the PipelineRun has succeed or failed and has a TTL set.", pr.Namespace, pr.Name) | ||
|
||
if pr.DeletionTimestamp == nil && pipelineRunCleanup(pr) { | ||
controller.NewImpl(tc, tc.Logger, ControllerName).Enqueue(pr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused about this. TTLExpiredController
is an alpha-level kubernetes controller, is that right? And here we are creating an instance of that controller and enqueuing a pipelinerun for it. But the kubernetes docs for the controller say that it only works for Jobs. So what does this do exactly?
return nil, err | ||
} | ||
if finishAt.UTC().After(since.UTC()) { | ||
tc.Logger.Warnf("Warning: Found PipelineRun %s/%s succeeded in the future. This is likely due to time skew in the cluster. PipelineRun cleanup will be deferred.", pr.Namespace, pr.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a comment here that points to https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#time-skew
Thanks for reviewing, I'll update commit. And for document, I have linked it in "Design docs in flight" before, but it looks deleted. |
This PR hasn't seen any updates for a month. Given its scope I feel strongly that we need to have this design presented to, and discussed by, the working group before we can move forward with it. @anxinyf it would be great if you want to do that but I would also be equally happy to see anyone else do so. I'm going to close this PR for now but strongly encourage that it be reopened once the design has been surfaced with the broader working group for review. Many thanks! |
Sorry about delaying the pr. I originally wanted to update it yesterday. If any need, I could discuss together with the working group. :D |
That would be excellent, thank you! You can add an entry to the Pending PR Discussion section of this doc: https://docs.google.com/document/d/1rPR7m1Oj0ip3bpd_bcS1sjZyPgGi_g9asF5YrExeESc/edit# You'll need to be a member of the tekton-dev group to open that doc if you are not already: https://github.com/tektoncd/community/blob/master/contact.md#mailing-list |
@anxinyf: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Also: the meeting this week is in a US-friendly timezone, at 12pm EST on Wednesday. But we also have a meeting that is more friendly to Europe and Asia which will run next Wednesday at 7:30 EST. Presenting to either one is totally fine! Thanks again. |
In slack @csantanapr pointed out that maybe another way of doing this is to create a cron that checks for expired Runs and deletes them |
Building this ttl controller into the Tekton controller may be unnecessary. :( We can build a CronJob directly with ExpirationSecondsTTL( or another field). And this is a more efficient approach. |
We plan to start exploring the CronJob idea as a tool in experimental for use in our dogfooding deployment. Given that idea shows some promise and would likely require no extra code in the pipelines controller I'm going to close this PR for now in favor of that approach. Let's revisit this idea if the CronJob proposal doesn't work out well enough. |
Yes! And I wanna say "Thank you for review" |
Hello @sbwsg , any document or discussions about the CronJob now? We have the same requirement to delete the completed resources automatically. ( Status may need to be considered, for example Failed PipelineRun should be automatically deleted after a longer time than the Succeed PipelineRun). |
I don't have any documentation to link you to but I've started an issue to help capture info: tektoncd/experimental#479 Feel free to add your own use cases there! It would be good to start collecting these together to help inform whoever picks up the work designing the tool. |
From this commit, I'll break PR #1429 up into smaller PRs.
Add field "ExpirationSecondsTTL" in TaskRunSpec and PipelineRunSpec