Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun with InvalidImageName runs forever #6105

Closed
chmouel opened this issue Feb 3, 2023 · 10 comments · Fixed by #6982
Closed

TaskRun with InvalidImageName runs forever #6105

chmouel opened this issue Feb 3, 2023 · 10 comments · Fixed by #6982
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@chmouel
Copy link
Member

chmouel commented Feb 3, 2023

Expected Behavior

When an image name of a step is invalid, the taskrun/pipelinerun doesn't get canceled, it waits forever (or as long the timeout is effective)

Actual Behavior

Fail only until the timeout expired.

Steps to Reproduce the Problem

  1. Create an image with a step and a invalid image (ie: a wrong length sha256)
  2. Launch and observe

Additional Info

Latest pipeline on Latest kind

@lbernick was mentioning the right approach to handle this here:
#4846 (comment)

@chmouel chmouel added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2023
@pritidesai pritidesai added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 3, 2023
@jerop jerop added this to the Pipelines v0.48 milestone Apr 12, 2023
@pritidesai
Copy link
Member

/assign @vdemeester

@vdemeester to check if we have addressed this already!

@pritidesai
Copy link
Member

Pipelines WG - @jerop please move to the next milestone if we haven't heard back on this before releasing 0.48

@jerop jerop added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 30, 2023
@JeromeJu
Copy link
Member

WG
/assign @afrittoli

@afrittoli
Copy link
Member

I left a comment here as a start #4846 (comment) - I will check if there is a way we can make the Pod fail right away in case of InvalidImageName

@afrittoli
Copy link
Member

@chmouel I would argue that this is not critical urgent. A faster fail would be nice to have, however, if I understood correctly, the k8s Pod controller will keep trying until a timeout, so Tekton honours that behaviour.

@lbernick lbernick removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 20, 2023
@afrittoli afrittoli reopened this Jul 11, 2023
@afrittoli
Copy link
Member

afrittoli commented Jul 11, 2023

Testing the current behaviour. Pod events:

Events:
  Type     Reason         Age                From               Message
  ----     ------         ----               ----               -------
  Normal   Scheduled      36s                default-scheduler  Successfully assigned default/print-date-vhb8z-pod to tekton-worker
  Normal   Pulled         36s                kubelet            Container image "kind.local/entrypoint-bff0a22da108bc2f16c818c97641a296:6a255c5f9657edbd4799fb28106d5c86a2de99b20db04f28338eddee4ea4e4a4" already present on machine
  Normal   Created        36s                kubelet            Created container prepare
  Normal   Started        36s                kubelet            Started container prepare
  Normal   Pulling        35s                kubelet            Pulling image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791"
  Normal   Created        30s                kubelet            Created container place-scripts
  Normal   Pulled         30s                kubelet            Successfully pulled image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791" in 4.540179473s
  Normal   Started        30s                kubelet            Started container place-scripts
  Normal   Pulling        30s                kubelet            Pulling image "bash:latest"
  Normal   Pulled         26s                kubelet            Successfully pulled image "bash:latest" in 3.505431622s
  Normal   Created        26s                kubelet            Created container step-print-date-human-readable
  Normal   Started        26s                kubelet            Started container step-print-date-human-readable
  Warning  InspectFailed  14s (x4 over 30s)  kubelet            Failed to apply default image tag "bash:latest:invalid@foo": couldn't parse image reference "bash:latest:invalid@foo": invalid reference format
  Warning  Failed         14s (x4 over 30s)  kubelet            Error: InvalidImageName

TaskRun events:

Events:
  Type    Reason           Age    From     Message
  ----    ------           ----   ----     -------
  Normal  Started          2m31s  TaskRun
  Normal  Pending          2m31s  TaskRun  Pending
  Normal  Pending          2m31s  TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
  Normal  Pending          2m30s  TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts]"
  Normal  Pending          2m25s  TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-print-date-unix-timestamp step-print-date-human-readable]"
  Normal  PullImageFailed  2m21s  TaskRun  build step "step-print-date-unix-timestamp" is pending with reason "Failed to apply default image tag \"bash:latest:invalid@foo\": couldn't parse image reference \"bash:latest:invalid@foo\": invalid reference format"

The InspectFailed from the kubelet turns into a PullImageFailed on Tekton side, which could be a temporary issue, so we probably need to fix the mapping here. The behaviour of k8s is to try again over and over again - I believe that's because users are allowed to edit the pod and change the image, so the issue may go away.
Tekton never intended to reconcile moving task/step definition, so we are in a position to fail fast here.

@afrittoli afrittoli removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 11, 2023
@afrittoli
Copy link
Member

afrittoli commented Jul 12, 2023

The TaskRun controller reviews the Pod status, and when its phase is Pending it updates the TaskRun accordingly to a similar pending state. Marking the TaskRun as failed when the InvalidImageName is found is easy, the problem though is that that would lead to a failed TaskRun with a pending Pod which could lead to an inconsistent case.

Is there a way other than deletion to terminate the pod?
One way could be to rewrite the pod spec to something that fails, but that might be confusing to the end user.
Alternatively, we could delete the pod similar to what we do for cancellation.

Once the pod is signalled for deletion or stop:

  • set the TaskRun status to failed, and trust that the Pod will stop eventually. When that happens, if the Pod is marked as failed and the TaskRun was already failed, ignore the update
  • leave the TaskRun status pending. When the Pod processes the termination and stops, in the associated reconcile cycle update the TaskRun to failed. It may be difficult in this scenario to carry over the real reason for failure, I need to check if the termination can be associated with a message

@afrittoli
Copy link
Member

@chmouel @tektoncd/core-maintainers it seems to me that a good solution is more complex than the value it brings, but I'd be happy to continue working on this if you think it's worth it. We need to decide what the behaviour of the Pod will be though. One way to handle this with current logic would be to cancel the taskrun when an invalid image is detected.

@lbernick
Copy link
Member

@afrittoli thank you for looking into this! I think deleting the pod and failing the TaskRun makes the most sense. I don't see a problem with deletion if the image pull error isn't retryable (hopefully we have a way of determining if the error is retryable or not).

Cancelation might be a bit confusing but I think it would also be an OK solution. I just want to note that if we choose cancelation for invalid image pulls and end up implementing support for canceling the taskrun via the entrypoint (#3238), we may need to replace the user-provided image with a noop image or something like that, since the entrypoint can't start before the image is pulled.

@vdemeester
Copy link
Member

@afrittoli I do also think deleting the pod and failing the taskrun makes the most sense. We just need to make sure we provide enough information to the users in the TaskRun for them not to be confused that there is not Pod attached.

afrittoli added a commit to afrittoli/pipeline that referenced this issue Jul 27, 2023
The kubernets pod treats an invallid image failures as potentially
ephemeral errors, because even if the format of the image reference
is not syntactically correct, users may update image without
recreating the Pod.

Tekton, however, uses Pod to provide workloads that run to completion,
and users are not allowed to change the specification of steps
during execution.

This commits changes the handling of the InvalidImageName pod reason,
so that the TaskRun is marked as failed and the Pod deleted.

Fixes: tektoncd#6105

Signed-off-by: Andrea Frittoli <[email protected]>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Jul 27, 2023
The kubernets pod treats an invallid image failures as potentially
ephemeral errors, because even if the format of the image reference
is not syntactically correct, users may update image without
recreating the Pod.

Tekton, however, uses Pod to provide workloads that run to completion,
and users are not allowed to change the specification of steps
during execution.

This commits changes the handling of the InvalidImageName pod reason,
so that the TaskRun is marked as failed and the Pod deleted.

Fixes: tektoncd#6105

Signed-off-by: Andrea Frittoli <[email protected]>
tekton-robot pushed a commit that referenced this issue Jul 31, 2023
The kubernets pod treats an invallid image failures as potentially
ephemeral errors, because even if the format of the image reference
is not syntactically correct, users may update image without
recreating the Pod.

Tekton, however, uses Pod to provide workloads that run to completion,
and users are not allowed to change the specification of steps
during execution.

This commits changes the handling of the InvalidImageName pod reason,
so that the TaskRun is marked as failed and the Pod deleted.

Fixes: #6105

Signed-off-by: Andrea Frittoli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants