Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TaskRun Step States After Cancellation or Timeout #3088

Merged

Conversation

danielhelfand
Copy link
Member

Closes tektoncd/cli#1095

Changes

In the event a TaskRun is timed out or cancelled, pipelines currently leaves the state of steps as Running or PodInitializing depending on how far the TaskRun has proceeded. This pull request adds a feature to update the status of steps to TaskRunTimeout or TaskRunCancelled. This will help with properly reporting step statuses through tkn to end users to avoid confusion.

The current approach only updates a step status if the step state is Running or Waiting. If a step has already completed, the status will be preserved.

I am open to suggestions on a different status that can be returned for steps other than the TaskRun reason.

Submitter Checklist

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Update step statuses on TaskRun in event of cancellation or timeout

@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 11, 2020
@tekton-robot tekton-robot requested review from dibyom and dlorenc August 11, 2020 13:28
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 11, 2020
@danielhelfand
Copy link
Member Author

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 11, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 2e6630e to 7190933 Compare August 11, 2020 13:44
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 7190933 to 35d5a6d Compare August 11, 2020 15:00
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 35d5a6d to 717ea47 Compare August 11, 2020 15:35
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 717ea47 to 8197294 Compare August 11, 2020 16:29
@danielhelfand danielhelfand changed the title WIP: Update TaskRun Step States After Cancellation or Timeout Update TaskRun Step States After Cancellation or Timeout Aug 11, 2020
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 11, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2020
@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 8197294 to 8c8b3d0 Compare August 12, 2020 16:31
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@pritidesai
Copy link
Member

pritidesai commented Aug 21, 2020

@danielhelfand the changes looks good and fixes the step status for a TaskRun but we have similar issue for a PipelineRun. We can merge this and open an issue or another PR for PipelineRun. WDYT?

When a task times out as part of a Pipeline, these changes does update the pipeline run status with an appropriate step status but tkn CLI still shows the taskrun and step both Running. Is this something we have to fix in CLI Repo?

            "status": {
                "taskRuns": {
                    "pipelinerun-with-timeout-echo-good-morning-758nt": {
                        "status": {
                            "conditions": [
                                {
                                    "lastTransitionTime": "2020-08-21T23:35:20Z",
                                    "message": "TaskRun \"pipelinerun-with-timeout-echo-good-morning-758nt\" failed to finish within \"30s\"",
                                    "reason": "TaskRunTimeout",
                                    "status": "False",
                                    "type": "Succeeded"
                                }
                            ],
                            "steps": [
                                {
                                    "terminated": {
                                        "exitCode": 1,
                                        "finishedAt": "2020-08-21T23:35:20Z",
                                        "reason": "TaskRunTimeout",
                                        "startedAt": "2020-08-21T23:34:58Z"
                                    }
                                }
                            ],

I tried on my local cluster for this sample pipeline run:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: pipelinerun-with-timeout
spec:
  timeout: "0h0m3s"
  pipelineSpec:
    tasks:
      - name: echo-good-morning
        timeout: "0h0m30s"
        taskSpec:
          steps:
            - name: echo
              image: ubuntu
              script: |
                #!/usr/bin/env bash
                echo "Good Morning!"
                sleep 60
---

And here is the tkn describe pr and tr:

 [66] ->  tkn pr list
NAME                       STARTED         DURATION    STATUS
pipelinerun-with-timeout   5 seconds ago   3 seconds   Failed(PipelineRunTimeout)

 [67] -> tkn pr describe pipelinerun-with-timeout
Name:        pipelinerun-with-timeout
Namespace:   default
Timeout:     3s
Labels:
 tekton.dev/pipeline=pipelinerun-with-timeout

🌡️  Status

STARTED          DURATION    STATUS
14 seconds ago   3 seconds   Failed(PipelineRunTimeout)

💌 Message

PipelineRun "pipelinerun-with-timeout" failed to finish within "3s"

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                                 TASK NAME           STARTED          DURATION   STATUS
 ∙ pipelinerun-with-timeout-echo-good-morning-758nt   echo-good-morning   14 seconds ago   ---        Running

 [68] -> tkn tr describe pipelinerun-with-timeout-echo-good-morning-758nt
Name:        pipelinerun-with-timeout-echo-good-morning-758nt
Namespace:   default
Timeout:     30s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/pipeline=pipelinerun-with-timeout
 tekton.dev/pipelineRun=pipelinerun-with-timeout
 tekton.dev/pipelineTask=echo-good-morning

🌡️  Status

STARTED          DURATION    STATUS
26 seconds ago   ---         Running

📨 Input Resources

 No input resources

📡 Output Resources

 No output resources

⚓ Params

 No params

🦶 Steps

 NAME     STATUS
 ∙ echo   Running

🚗 Sidecars

No sidecars

@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 8c8b3d0 to 541625e Compare August 24, 2020 13:53
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@danielhelfand
Copy link
Member Author

@pritidesai Looking into this a bit further, I think the status needs some time before it updates. I noticed the same issue you pointed out, but it eventually does update the statuses correctly:

$ tkn pr desc --last

Name:        pipelinerun-with-timeout
Namespace:   default
Timeout:     3s
Labels:
 tekton.dev/pipeline=pipelinerun-with-timeout

🌡️  Status

STARTED         DURATION    STATUS
9 minutes ago   3 seconds   Failed(PipelineRunTimeout)

💌 Message

PipelineRun "pipelinerun-with-timeout" failed to finish within "3s" (TaskRun "pipelinerun-with-timeout-echo-good-morning-hkvjs" failed to finish within "30s")

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                                 TASK NAME           STARTED         DURATION     STATUS
 ∙ pipelinerun-with-timeout-echo-good-morning-hkvjs   echo-good-morning   9 minutes ago   30 seconds   Failed(TaskRunTimeout)
$ tkn tr desc --last

Name:        pipelinerun-with-timeout-echo-good-morning-hkvjs
Namespace:   default
Timeout:     30s
Labels:
 app.kubernetes.io/managed-by=tekton-pipelines
 tekton.dev/pipeline=pipelinerun-with-timeout
 tekton.dev/pipelineRun=pipelinerun-with-timeout
 tekton.dev/pipelineTask=echo-good-morning

🌡️  Status

STARTED          DURATION     STATUS
10 minutes ago   30 seconds   Failed(TaskRunTimeout)

Message

TaskRun "pipelinerun-with-timeout-echo-good-morning-hkvjs" failed to finish within "30s"

📨 Input Resources

 No input resources

📡 Output Resources

 No output resources

⚓ Params

 No params

🦶 Steps

 NAME     STATUS
 ∙ echo   TaskRunTimeout

🚗 Sidecars

No sidecars

@danielhelfand
Copy link
Member Author

Looking further into the issue, this looks like a case where the TaskRun is actually still Running and is not timed out despite the PipelineRun timeout. This is occurring in v0.15.2 with this same PipelineRun, so it might be related to issues around timeouts with pipelines. But I do think this is not related to the changes here.

@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from 541625e to f47169f Compare August 24, 2020 14:48
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@pritidesai
Copy link
Member

Sounds good thanks @danielhelfand.

@bobcatfish was looking into timeout and we have an issue for it #3038.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pritidesai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2020
@danielhelfand danielhelfand force-pushed the timeout_cancel_step_status branch from f47169f to 1018eaa Compare August 24, 2020 20:23
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 78.3% 78.9% 0.6

@danielhelfand
Copy link
Member Author

/cc @vdemeester @bobcatfish @afrittoli

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2020
@tekton-robot tekton-robot merged commit fb613dc into tektoncd:master Aug 27, 2020
AlanGreene added a commit to AlanGreene/dashboard that referenced this pull request Oct 15, 2020
Pipelines now updates the status of steps for cancelled or timed out
TaskRuns, see tektoncd/pipeline#3088

Update the styles and labels to display these as cancelled rather
than failed (which was the fallback for unknown status/reason combos)
AlanGreene added a commit to AlanGreene/dashboard that referenced this pull request Oct 15, 2020
Pipelines now updates the status of steps for cancelled or timed out
TaskRuns, see tektoncd/pipeline#3088

Update the styles and labels to display these as cancelled rather
than failed (which was the fallback for unknown status/reason combos)
tekton-robot pushed a commit to tektoncd/dashboard that referenced this pull request Oct 15, 2020
Pipelines now updates the status of steps for cancelled or timed out
TaskRuns, see tektoncd/pipeline#3088

Update the styles and labels to display these as cancelled rather
than failed (which was the fallback for unknown status/reason combos)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The status of step stays in Running state when the taskrun fails with TaskRunTimeout
4 participants