Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write TaskRun.Status.TaskSpec with replaced spec on every reconcile run #5576

Merged

Conversation

abayer
Copy link
Contributor

@abayer abayer commented Sep 28, 2022

Changes

fixes #5574

With #5537, the call to applyParamsContextsResultsAndWorkspaces and setting of tr.Status.TaskSpec to the output of that call was moved from happening with every call to reconcile to only happening if there wasn't already a pod created. This didn't cause any problems with execution, but it sometimes resulted in tr.Status.TaskSpec not getting set properly or getting reset to nil on a subsequent reconcile run, with the end result that, sometimes, tr.Status.TaskSpec would contain the original TaskSpec without parameter, result, context, and workspace references being replaced with the corresponding value. This caused flaky failures in the TestPipelineRunStatusSpec/pipeline_status_spec_updated e2e test, and integration test failures in Chains (tektoncd/chains#577).

This change moves the PVC creation out of the if pod == nil { block that creates the pod if needed, while still checking if pod is nil before creating the PVC, and brings the applyParamsContextsResultsAndWorkspaces call and setting of tr.Status.TaskSpec back out of the pod creation block, but after the possible PVC creation.

Note that I never managed to figure out what conditions resulted in the flaky behavior - I was deep in that rabbit hole when @wlynch flagged the Chains issue and I realized what the root cause was. Since I never nailed down the flake trigger, I don't have a new e2e test to add here - but then again, since I started looking at this in the first place due to TestPipelineRunStatusSpec/pipeline_status_spec_updated flaking, we do know that it's covered to some extent. =)

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix TaskRun parameter etc replacement logic to persist in the TaskRun's Status properly

@abayer abayer added the kind/bug Categorizes issue or PR as related to a bug. label Sep 28, 2022
@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 28, 2022
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 28, 2022
@abayer
Copy link
Contributor Author

abayer commented Sep 28, 2022

/assign @piyush-garg @wlynch @vdemeester @jerop

@abayer abayer added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 28, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 81.3% 81.4% 0.1

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cherry-pick release-v0.40.x

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2022
Copy link
Member

@wlynch wlynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth doing a deeper comparison here instead of just checking that the reconcile succeeded? 🤔

tr, err := clients.Pipeline.TektonV1beta1().TaskRuns(tc.taskRun.Namespace).Get(testAssets.Ctx, tc.taskRun.Name, metav1.GetOptions{})
if err != nil {
t.Fatalf("getting updated taskrun: %v", err)
}
condition := tr.Status.GetCondition(apis.ConditionSucceeded)
if condition == nil || condition.Status != corev1.ConditionUnknown {
t.Errorf("Expected invalid TaskRun to have in progress status, but had %v", condition)
}
if condition != nil && condition.Reason != v1beta1.TaskRunReasonRunning.String() {
t.Errorf("Expected reason %q but was %s", v1beta1.TaskRunReasonRunning.String(), condition.Reason)
}

Comment on lines +499 to +507
workspaceVolumes := workspace.CreateVolumes(tr.Spec.Workspaces)

ts, err := applyParamsContextsResultsAndWorkspaces(ctx, tr, rtr, workspaceVolumes)
if err != nil {
logger.Errorf("Error updating task spec parameters, contexts, results and workspaces: %s", err)
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be computed each time, or does this only need to happen if pod == nil?

This might help clean things up a little bit if we can preserve the old structure so we can keep all the if pod == nil logic together - i.e.

if pod == nil {
  if tr.HasVolumeClaimTemplate() {
    ...
  }
 
  workspaceVolumes := workspace.CreateVolumes(tr.Spec.Workspaces)
  ...

  pod, err = c.createPod(ctx, ts, tr, rtr, workspaceVolumes)
  ...
}

tr.Status.TaskSpec = ts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we need workspaceVolumes as input to applyParamsContextsResultsAndWorkspaces, and the whole issue here stemmed from that not getting called and tr.Status.TaskSpec not getting updated, so I think we do need to the createVolumes call every time. That's what was being done before 9cf7590

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bleh - that's unfortunate. Might be something worth looking into refactoring (i.e. regrouping things that only need to be done once vs per reconcile), but won't hold up this PR for it.

@wlynch
Copy link
Member

wlynch commented Sep 28, 2022

/lgtm
/hold

Holding in case you wanted to make any changes to taskrun_test.go before submission, else feel free to remove and submit.

@tekton-robot tekton-robot added lgtm Indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Sep 28, 2022
@abayer
Copy link
Contributor Author

abayer commented Sep 28, 2022

Holding in case you wanted to make any changes to taskrun_test.go before submission, else feel free to remove and submit.

I think I'm going to write some new tests to exercise all of this properly, but that'll probably take me a day or two, so we can merge this before then if it's urgent enough. But let's see what I can get done in the next hour or two...

@abayer
Copy link
Contributor Author

abayer commented Sep 28, 2022

(the reason for new tests is that that particular test, for example, would always pass even without this fix, even if it was actually doing param/etc replacement, because it's only getting reconciled once and is creating a pod - the problem was reconciling when there isn't a pod)

@abayer abayer force-pushed the taskrun-status-taskspec-dont-overwrite branch from 3e6e4dd to 637026d Compare September 28, 2022 19:52
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 28, 2022
name: myarg
type: string
steps:
- script: echo $(inputs.params.myarg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only testing for replacement of a param here because that's good enough to prove the point, and is simplest to do. Note that this issue does not show up with an inline tr.Spec.TaskSpec, interestingly.

Copy link
Member

@wlynch wlynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022
value: foo
taskRef:
name: test-task-with-replacements
status:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried this with a pre-set tr.Status.TaskSpec, but that kept passing when it shouldn't have. I'm guessing it might have eventually flaked into failure, but it's very possible that the specific conditions that triggered the flaky behavior would never actually show up in a unit test, so I just went with testing to make sure that a nil tr.Status.TaskSpec gets written properly.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 81.3% 81.4% 0.1

@abayer
Copy link
Contributor Author

abayer commented Sep 28, 2022

@wlynch Ok, I think I'm good with this test. It's not exercising whatever the specific trigger is that sets off the flaky behavior where tr.Status.TaskSpec ends up nil on reconcile, but it is making sure that the replacements are being applied to tr.Status.TaskSpec when there already is a pod.

taskRunWorkspaces := applyVolumeClaimTemplates(tr.Spec.Workspaces, *kmeta.NewControllerRef(tr))
// This is used by createPod below. Changes to the Spec are not updated.
tr.Spec.Workspaces = taskRunWorkspaces
if pod == nil && tr.HasVolumeClaimTemplate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth adding a comment here on the conditions to avoid the conditions moving around again when the next change happens in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fixes tektoncd#5574

With tektoncd#5537, the call to `applyParamsContextsResultsAndWorkspaces` and setting of `tr.Status.TaskSpec` to the output of that call was moved from happening with every call to `reconcile` to only happening if there wasn't already a pod created. This didn't cause any problems with execution, but it sometimes resulted in `tr.Status.TaskSpec` not getting set properly or getting reset to `nil` on a subsequent `reconcile` run, with the end result that, sometimes, `tr.Status.TaskSpec` would contain the original `TaskSpec` without parameter, result, context, and workspace references being replaced with the corresponding value. This caused flaky failures in the `TestPipelineRunStatusSpec/pipeline_status_spec_updated` e2e test, and integration test failures in Chains (tektoncd/chains#577).

This change moves the PVC creation out of the `if pod == nil {` block that creates the pod if needed, while still checking if `pod` is `nil` before creating the PVC, and brings the `applyParamsContextsResultsAndWorkspaces` call and setting of `tr.Status.TaskSpec` back out of the pod creation block, but after the possible PVC creation.

Signed-off-by: Andrew Bayer <[email protected]>
@abayer abayer force-pushed the taskrun-status-taskspec-dont-overwrite branch from 637026d to 8bb5e52 Compare September 29, 2022 12:07
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 29, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 81.3% 81.4% 0.1

@abayer
Copy link
Contributor Author

abayer commented Sep 29, 2022

@wlynch ping? =)

Copy link
Member

@wlynch wlynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 29, 2022
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester, wlynch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abayer
Copy link
Contributor Author

abayer commented Sep 29, 2022

@wlynch Sorry to bug you again, but I didn't want to remove the hold without your approval.

@wlynch
Copy link
Member

wlynch commented Sep 29, 2022

/hold cancel

Hold was more for you than for me. :)

/gogogo

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 29, 2022
@abayer
Copy link
Contributor Author

abayer commented Sep 29, 2022

Yeah, but I just wanted to be polite. =) Thanks!

@tekton-robot tekton-robot merged commit 4b53ae5 into tektoncd:main Sep 29, 2022
@abayer
Copy link
Contributor Author

abayer commented Sep 29, 2022

/cherry-pick release-v0.40.x

@tekton-robot
Copy link
Collaborator

@abayer: new pull request created: #5584

In response to this:

/cherry-pick release-v0.40.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TaskRun.Status.TaskSpec is sometimes overwritten without replacements applied
7 participants