Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support to delete completed TaskRun or PipelineRun automatically #1486

Closed
wants to merge 16 commits into from

Conversation

anxinyf
Copy link

@anxinyf anxinyf commented Oct 28, 2019

From this commit, I'll break PR #1429 up into smaller PRs.

Add field "ExpirationSecondsTTL" in TaskRunSpec and PipelineRunSpec

Add field "ExpirationSecondsTTL" in TaskRunSpec and PipelineRunSpec.

If the field "ExpirationSecondsTTL" is set, after the TaskRun finishes, it is eligible to be automatically deleted. When the TaskRun or PipelineRun is being deleted, its lifecycle guarantees will be honored.

If this field is unset, the TaskRun won't be automatically deleted.

If this field is set to zero, the TaskRun or PipelineRun becomes eligible to be deleted immediately after it finishes. And if taskRun is produced by PipelineRun(checked via ownerReference), this ExpirationSecondsTTL will be ignored, then tekton will check PipelineRun's ExpirationSecondsTTL to determine if PipelineRun should be deleted.
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign imjasonh
You can assign the PR to them by writing /assign @imjasonh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 28, 2019
@tekton-robot
Copy link
Collaborator

Hi @anxinyf. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 28, 2019
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Oct 28, 2019
Adding taskrun_expired.go and pipelinerun_expired.go which implement
approaches to delete completed TaskRun or PipelineRun via field "ExpirationSecondsTTL".
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 28, 2019
@bobcatfish
Copy link
Collaborator

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2019
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking up your PR from #1429!!

Can you describe a bit about what the total set of PRs would look like? One thing I notice from this PR is that it doesn't include the docs and the tests that are present in #1429 - we usually try to include docs and tests with every PR when possible (see development guidelines), so before this particular PR would be mergeable we'd want it to include docs + tests as well.

It's also a bit hard from this PR to get a sense for how some of these pieces would be used, e.g. AddTaskRun.

I'm wondering about a couple of alternative implementations (which maybe you have already considered!):

  • Could we use the same logic that handles timing out Runs?
  • What about making a totally new controller instead?
  • It looks like the controller lib we're using (knative/pkg) has an EnqeueKeyAfter function - this makes me wonder if we could use the same workqueue we're using for everything else to handle cleanup as well? Then it would be the job of our main reconcile function to check if the Run needs to be deleted

https://github.com/knative/pkg/blob/29642b017b8bf1d3bba08b2b72850ff3cd00996a/controller/controller.go#L263-L268


// ListerSynced returns true if the TaskRun store has been synced at least once.
// Added as a member to the struct to allow injection for testing.
ListerSynced cache.InformerSynced
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im wondering if there is some way we can avoid adding more attributes and functionality to the reconciler object itself - do you think it might be possible to create a separate object for managing the timeouts, which we could use and test in isolation?
if so then maybe one way to break up #1429 (which might be what you're already going for!) would be to have this PR add that object, complete with unit test coverage, and describe how it is intended to be hooked up (I think this is pretty close to what you're doing already, it's just a bit hard right now to see the overall structure)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in #1429, I just do this in a new file ttl_controller.go, but for convenience, I add these new attibutes in Reconciler object. I'll consider it again.


func (tc *Reconciler) UpdatePipelineRun(old, cur interface{}) {
pr := cur.(*apispipeline.PipelineRun)
klog.V(4).Infof("Updating PipelineRun %s/%s", pr.Namespace, pr.Name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reconciler object actually has a Logger object which we usually use for logging, e.g.

c.Logger.Infof("task run %q in work queue no longer exists", key)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I'll fix it

// its TTL hasn't expired, it will be added to the queue after the TTL is expected
// to expire.
// This function is not meant to be invoked concurrently with the same key.
func (tc *Reconciler) processTaskRunExpired(namespace, name string, tr *apispipeline.TaskRun) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, do you think it's possible to use the same logic for TaskRuns and PipelineRuns, or do they definitely need to be distinct?

Copy link
Author

@anxinyf anxinyf Dec 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because processTaskRunExpired needs to call HasPipelineRunOwnerReference but processPipelineRunExpired not, and tc.processTrTTL calls taskRunCleanup which is different from pipelineRunCleanup. So I would like to implement this separately :)

return &finishAtUTC, &expireAtUTC, nil
}

func trTimeLeft(tr *apispipeline.TaskRun, since *time.Time) (*time.Duration, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is broken up into lots of small functions - one thing that we'd definitely want to see in this PR is as much unit test coverage as we can get - the easiest way to do this (which usually leads to the best code!) is to create unit tests for individual functions like these

Copy link
Author

@anxinyf anxinyf Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, It's necessary! In fact, I have test for this, upload soon :)

@anxinyf
Copy link
Author

anxinyf commented Oct 29, 2019

Thx for review. I have tests and docs, and will add them.
Sorry firstly, functions like AddTaskRun, AddPipelineRun etc. are ued in controller.go file, I'll upload next. :)

About "Could we use the same logic that handles timing out Runs?",
Yes! This is next step I'll do. Separate this ttl controller from reconciler controller. Because it will be extended to handle other finishable resource types.

About "What about making a totally new controller instead?"
This mechanism can be TTL Controller about deleting completed, failed or other status resources when ExpirationSecondsTTL is expired.

About third point, I noticed these functions before, but some may need to be modified, I'll check again.

1. Add field "ExpirationTime" in PipelineRunStatus and TaskRunStatus to display expired time.

2. Add unit tests file pipelinerun_expired_test.go and taskrun_expired_test.go
@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 30, 2019
1. Add field "ExpirationTime" in PipelineRunStatus and TaskRunStatus to display expired time.

2. Add unit tests file pipelinerun_expired_test.go and taskrun_expired_test.go
@anxinyf
Copy link
Author

anxinyf commented Oct 31, 2019

/test tekton-pipeline-unit-tests

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/pipelinerun_validation.go 93.8% 84.2% -9.5
pkg/apis/pipeline/v1alpha1/taskrun_validation.go 95.3% 91.3% -4.0
pkg/status/taskrunpod.go 100.0% 98.6% -1.4
test/builder/pipeline.go 84.2% 84.1% -0.0
test/builder/task.go 83.4% 81.6% -1.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/pipelinerun_validation.go 93.8% 84.2% -9.5
pkg/apis/pipeline/v1alpha1/taskrun_validation.go 95.3% 91.3% -4.0
pkg/status/taskrunpod.go 100.0% 98.6% -1.4
test/builder/pipeline.go 84.2% 84.1% -0.0
test/builder/task.go 83.4% 81.6% -1.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/pipelinerun_validation.go 93.8% 84.2% -9.5
pkg/apis/pipeline/v1alpha1/taskrun_validation.go 95.3% 91.3% -4.0
pkg/status/taskrunpod.go 100.0% 98.6% -1.4
test/builder/pipeline.go 84.2% 84.1% -0.0
test/builder/task.go 83.4% 81.6% -1.9

modify some unit test files
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/pipeline.go 87.3% 87.5% 0.2
test/builder/task.go 82.3% 82.1% -0.2

Bug fix for build test.
…ktoncd-master

Conflicts:
	go.mod
	vendor/modules.txt
…into tektoncd-master

Conflicts:
	pkg/apis/pipeline/v1alpha1/zz_generated.deepcopy.go
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/pipeline.go 87.3% 87.5% 0.2
test/builder/task.go 82.3% 82.1% -0.2

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I like @bobcatfish's suggestion to extract c.AddPipelineRun and c.UpdatePipelineRun
  • you'll need to squash your commits 👼

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2020
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anxinyf I left a bunch of comments on the code.

As I mentioned here I would still really like to see this feature presented to the working group before we decide to merge it. It would be helpful for the community to discuss it and also for you to present the way that you / your company plan to use this feature once its merged.

As a reminder, here are the instructions for joining the working group to present:

You can add this feature to discuss in the agenda here: https://docs.google.com/document/d/1rPR7m1Oj0ip3bpd_bcS1sjZyPgGi_g9asF5YrExeESc/edit?ts=5c917a0c#heading=h.t0kiqeqw4uo1

If you are not already a member of the tekton-dev google group then you'll first need to join here: https://groups.google.com/forum/#!forum/tekton-dev

And details of joining the working group call are on the tekton-dev calendar, which you can find here: https://github.com/tektoncd/community/blob/master/contact.md#calendar

}

func (tc *Reconciler) UpdatePipelineRun(old, cur interface{}) {
pr := cur.(*apispipeline.PipelineRun)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should possibly check if this was a successful cast to avoid panics. something like:

pr, ok := cur.(*apispipeline.PipelineRun)
if !ok {
  return // maybe log an error?
}

Also, same for AddPipelineRun on line 18

} else if !expired {
return nil
}
// Cascade deletes the PipelineRuns if TTL truly expires.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "Cascade" in this context?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return tc.PipelineClientSet.TektonV1alpha1().PipelineRuns(fresh.Namespace).Delete(fresh.Name, options)
}

// processTTL checks whether a given PipelineRun's TTL has expired, and add it to the queue after the TTL is expected to expire
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processTTL in comment should match function name below it


// processTTL checks whether a given PipelineRun's TTL has expired, and add it to the queue after the TTL is expected to expire
// if the TTL will expire later.
func (tc *Reconciler) processPrTTL(pr *apispipeline.PipelineRun) (expired bool, err error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically initialisms are all uppercase in Go, which would make this processPRTTL. I suggest processPipelineRunTTL as maybe a slightly more readable alternative here.

}

// pipelineRunCleanup checks whether a PipelineRun has succeeded or failed and has a TTL set.
func pipelineRunCleanup(pr *apispipeline.PipelineRun) bool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the name of this func suggests it will do some cleanup work but it's just a check for whether a pipelinerun needs to have its expiration processed. I suggest isPipelineRunExpiring() or shouldCleanupPipelineRun() or a similar such name here.


func newTaskRun(completionTime, failedTime apis.VolatileTime, ttl *metav1.Duration) *apispipeline.TaskRun {
tr := tb.TaskRun("test-task-run-with-expiration-ttl", "foo",
//tb.TaskRunOwnerReference("PipelineRun", "test-pipeline-run-with-annotations",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dead code.

expectedTimeLeft *time.Duration
}{
{
name: "Error case: TaskRun unfinished",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also suggest splitting this into two test funcs - one for success cases and one for error cases.

@@ -415,6 +415,17 @@ func TaskRunAffinity(affinity *corev1.Affinity) TaskRunSpecOp {
}
}

func TaskRunExpirationSecondsTTL(duration time.Duration) TaskRunSpecOp {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add func comment here.

tc.Logger.Infof("Updating PipelineRun %s/%s if the PipelineRun has succeed or failed and has a TTL set.", pr.Namespace, pr.Name)

if pr.DeletionTimestamp == nil && pipelineRunCleanup(pr) {
controller.NewImpl(tc, tc.Logger, ControllerName).Enqueue(pr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about this. TTLExpiredController is an alpha-level kubernetes controller, is that right? And here we are creating an instance of that controller and enqueuing a pipelinerun for it. But the kubernetes docs for the controller say that it only works for Jobs. So what does this do exactly?

return nil, err
}
if finishAt.UTC().After(since.UTC()) {
tc.Logger.Warnf("Warning: Found PipelineRun %s/%s succeeded in the future. This is likely due to time skew in the cluster. PipelineRun cleanup will be deferred.", pr.Namespace, pr.Name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a comment here that points to https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#time-skew

@anxinyf
Copy link
Author

anxinyf commented Jan 8, 2020

Thanks for reviewing, I'll update commit. And for document, I have linked it in "Design docs in flight" before, but it looks deleted.
Doc: https://docs.google.com/document/d/1Zk4dtifxS88Lo17NR3rMb20M-14JHD4YAxRxBUKs0jk/edit

@ghost
Copy link

ghost commented Feb 7, 2020

This PR hasn't seen any updates for a month. Given its scope I feel strongly that we need to have this design presented to, and discussed by, the working group before we can move forward with it. @anxinyf it would be great if you want to do that but I would also be equally happy to see anyone else do so.

I'm going to close this PR for now but strongly encourage that it be reopened once the design has been surfaced with the broader working group for review. Many thanks!

@ghost ghost closed this Feb 7, 2020
@anxinyf
Copy link
Author

anxinyf commented Feb 8, 2020

Sorry about delaying the pr. I originally wanted to update it yesterday. If any need, I could discuss together with the working group. :D

@ghost
Copy link

ghost commented Feb 10, 2020

That would be excellent, thank you! You can add an entry to the Pending PR Discussion section of this doc: https://docs.google.com/document/d/1rPR7m1Oj0ip3bpd_bcS1sjZyPgGi_g9asF5YrExeESc/edit#

You'll need to be a member of the tekton-dev group to open that doc if you are not already: https://github.com/tektoncd/community/blob/master/contact.md#mailing-list

@ghost ghost reopened this Feb 10, 2020
@tekton-robot
Copy link
Collaborator

@anxinyf: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-tekton-pipeline-go-coverage bc61e9f link /test pull-tekton-pipeline-go-coverage
pull-tekton-pipeline-unit-tests bc61e9f link /test tekton-pipeline-unit-tests
pull-tekton-pipeline-build-tests bc61e9f link /test pull-tekton-pipeline-build-tests
pull-tekton-pipeline-integration-tests bc61e9f link /test pull-tekton-pipeline-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ghost
Copy link

ghost commented Feb 10, 2020

Also: the meeting this week is in a US-friendly timezone, at 12pm EST on Wednesday. But we also have a meeting that is more friendly to Europe and Asia which will run next Wednesday at 7:30 EST. Presenting to either one is totally fine! Thanks again.

@bobcatfish
Copy link
Collaborator

bobcatfish commented Feb 14, 2020

In slack @csantanapr pointed out that maybe another way of doing this is to create a cron that checks for expired Runs and deletes them

@anxinyf
Copy link
Author

anxinyf commented Feb 16, 2020

Building this ttl controller into the Tekton controller may be unnecessary. :(

We can build a CronJob directly with ExpirationSecondsTTL( or another field). And this is a more efficient approach.

@ghost
Copy link

ghost commented Feb 19, 2020

We plan to start exploring the CronJob idea as a tool in experimental for use in our dogfooding deployment. Given that idea shows some promise and would likely require no extra code in the pipelines controller I'm going to close this PR for now in favor of that approach. Let's revisit this idea if the CronJob proposal doesn't work out well enough.

@ghost ghost closed this Feb 19, 2020
@anxinyf
Copy link
Author

anxinyf commented Feb 19, 2020

Yes! And I wanna say "Thank you for review"

@assertion
Copy link
Contributor

assertion commented Feb 24, 2020

We plan to start exploring the CronJob idea as a tool in experimental for use in our dogfooding deployment. Given that idea shows some promise and would likely require no extra code in the pipelines controller I'm going to close this PR for now in favor of that approach. Let's revisit this idea if the CronJob proposal doesn't work out well enough.

Hello @sbwsg , any document or discussions about the CronJob now? We have the same requirement to delete the completed resources automatically. ( Status may need to be considered, for example Failed PipelineRun should be automatically deleted after a longer time than the Succeed PipelineRun).

@ghost
Copy link

ghost commented Feb 26, 2020

any document or discussions about the CronJob now?

I don't have any documentation to link you to but I've started an issue to help capture info: tektoncd/experimental#479

Feel free to add your own use cases there! It would be good to start collecting these together to help inform whoever picks up the work designing the tool.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants