TaskRun with InvalidImageName runs forever #6105

chmouel · 2023-02-03T16:32:58Z

Expected Behavior

When an image name of a step is invalid, the taskrun/pipelinerun doesn't get canceled, it waits forever (or as long the timeout is effective)

Actual Behavior

Fail only until the timeout expired.

Steps to Reproduce the Problem

Create an image with a step and a invalid image (ie: a wrong length sha256)
Launch and observe

Additional Info

Latest pipeline on Latest kind

@lbernick was mentioning the right approach to handle this here:
#4846 (comment)

pritidesai · 2023-05-02T16:38:39Z

/assign @vdemeester

@vdemeester to check if we have addressed this already!

pritidesai · 2023-05-16T16:41:45Z

Pipelines WG - @jerop please move to the next milestone if we haven't heard back on this before releasing 0.48

JeromeJu · 2023-06-13T16:50:45Z

WG
/assign @afrittoli

afrittoli · 2023-06-17T21:47:08Z

I left a comment here as a start #4846 (comment) - I will check if there is a way we can make the Pod fail right away in case of InvalidImageName

afrittoli · 2023-06-17T21:48:52Z

@chmouel I would argue that this is not critical urgent. A faster fail would be nice to have, however, if I understood correctly, the k8s Pod controller will keep trying until a timeout, so Tekton honours that behaviour.

afrittoli · 2023-07-11T13:55:10Z

Testing the current behaviour. Pod events:

Events:
  Type     Reason         Age                From               Message
  ----     ------         ----               ----               -------
  Normal   Scheduled      36s                default-scheduler  Successfully assigned default/print-date-vhb8z-pod to tekton-worker
  Normal   Pulled         36s                kubelet            Container image "kind.local/entrypoint-bff0a22da108bc2f16c818c97641a296:6a255c5f9657edbd4799fb28106d5c86a2de99b20db04f28338eddee4ea4e4a4" already present on machine
  Normal   Created        36s                kubelet            Created container prepare
  Normal   Started        36s                kubelet            Started container prepare
  Normal   Pulling        35s                kubelet            Pulling image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791"
  Normal   Created        30s                kubelet            Created container place-scripts
  Normal   Pulled         30s                kubelet            Successfully pulled image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791" in 4.540179473s
  Normal   Started        30s                kubelet            Started container place-scripts
  Normal   Pulling        30s                kubelet            Pulling image "bash:latest"
  Normal   Pulled         26s                kubelet            Successfully pulled image "bash:latest" in 3.505431622s
  Normal   Created        26s                kubelet            Created container step-print-date-human-readable
  Normal   Started        26s                kubelet            Started container step-print-date-human-readable
  Warning  InspectFailed  14s (x4 over 30s)  kubelet            Failed to apply default image tag "bash:latest:invalid@foo": couldn't parse image reference "bash:latest:invalid@foo": invalid reference format
  Warning  Failed         14s (x4 over 30s)  kubelet            Error: InvalidImageName

TaskRun events:

Events:
  Type    Reason           Age    From     Message
  ----    ------           ----   ----     -------
  Normal  Started          2m31s  TaskRun
  Normal  Pending          2m31s  TaskRun  Pending
  Normal  Pending          2m31s  TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
  Normal  Pending          2m30s  TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts]"
  Normal  Pending          2m25s  TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-print-date-unix-timestamp step-print-date-human-readable]"
  Normal  PullImageFailed  2m21s  TaskRun  build step "step-print-date-unix-timestamp" is pending with reason "Failed to apply default image tag \"bash:latest:invalid@foo\": couldn't parse image reference \"bash:latest:invalid@foo\": invalid reference format"

The InspectFailed from the kubelet turns into a PullImageFailed on Tekton side, which could be a temporary issue, so we probably need to fix the mapping here. The behaviour of k8s is to try again over and over again - I believe that's because users are allowed to edit the pod and change the image, so the issue may go away.
Tekton never intended to reconcile moving task/step definition, so we are in a position to fail fast here.

afrittoli · 2023-07-12T11:44:30Z

The TaskRun controller reviews the Pod status, and when its phase is Pending it updates the TaskRun accordingly to a similar pending state. Marking the TaskRun as failed when the InvalidImageName is found is easy, the problem though is that that would lead to a failed TaskRun with a pending Pod which could lead to an inconsistent case.

Is there a way other than deletion to terminate the pod?
One way could be to rewrite the pod spec to something that fails, but that might be confusing to the end user.
Alternatively, we could delete the pod similar to what we do for cancellation.

Once the pod is signalled for deletion or stop:

set the TaskRun status to failed, and trust that the Pod will stop eventually. When that happens, if the Pod is marked as failed and the TaskRun was already failed, ignore the update
leave the TaskRun status pending. When the Pod processes the termination and stops, in the associated reconcile cycle update the TaskRun to failed. It may be difficult in this scenario to carry over the real reason for failure, I need to check if the termination can be associated with a message

afrittoli · 2023-07-12T11:51:40Z

@chmouel @tektoncd/core-maintainers it seems to me that a good solution is more complex than the value it brings, but I'd be happy to continue working on this if you think it's worth it. We need to decide what the behaviour of the Pod will be though. One way to handle this with current logic would be to cancel the taskrun when an invalid image is detected.

lbernick · 2023-07-12T14:58:11Z

@afrittoli thank you for looking into this! I think deleting the pod and failing the TaskRun makes the most sense. I don't see a problem with deletion if the image pull error isn't retryable (hopefully we have a way of determining if the error is retryable or not).

Cancelation might be a bit confusing but I think it would also be an OK solution. I just want to note that if we choose cancelation for invalid image pulls and end up implementing support for canceling the taskrun via the entrypoint (#3238), we may need to replace the user-provided image with a noop image or something like that, since the entrypoint can't start before the image is pulled.

vdemeester · 2023-07-13T14:27:18Z

@afrittoli I do also think deleting the pod and failing the taskrun makes the most sense. We just need to make sure we provide enough information to the users in the TaskRun for them not to be confused that there is not Pod attached.

The kubernets pod treats an invallid image failures as potentially ephemeral errors, because even if the format of the image reference is not syntactically correct, users may update image without recreating the Pod. Tekton, however, uses Pod to provide workloads that run to completion, and users are not allowed to change the specification of steps during execution. This commits changes the handling of the InvalidImageName pod reason, so that the TaskRun is marked as failed and the Pod deleted. Fixes: tektoncd#6105 Signed-off-by: Andrea Frittoli <[email protected]>

The kubernets pod treats an invallid image failures as potentially ephemeral errors, because even if the format of the image reference is not syntactically correct, users may update image without recreating the Pod. Tekton, however, uses Pod to provide workloads that run to completion, and users are not allowed to change the specification of steps during execution. This commits changes the handling of the InvalidImageName pod reason, so that the TaskRun is marked as failed and the Pod deleted. Fixes: #6105 Signed-off-by: Andrea Frittoli <[email protected]>

chmouel added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2023

chmouel mentioned this issue Feb 3, 2023

Bubble up the image related error reason to taskrun status #4846

Merged

5 tasks

pritidesai added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 3, 2023

jerop added this to the Pipelines v0.48 milestone Apr 12, 2023

tekton-robot assigned vdemeester May 2, 2023

jerop modified the milestones: Pipelines v0.48, Pipelines v0.49 May 23, 2023

jerop added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 30, 2023

jerop unassigned vdemeester May 30, 2023

tekton-robot assigned afrittoli Jun 13, 2023

lbernick removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 20, 2023

lbernick modified the milestones: Pipelines v0.49, Pipelines v0.50 (LTS) Jun 20, 2023

afrittoli closed this as completed Jul 11, 2023

afrittoli reopened this Jul 11, 2023

afrittoli removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 11, 2023

vdemeester modified the milestones: Pipelines v0.50 (LTS), Pipelines v0.51 Jul 21, 2023

afrittoli mentioned this issue Jul 27, 2023

Fail fast on invalid image #6982

Merged

7 tasks

tekton-robot closed this as completed in #6982 Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskRun with InvalidImageName runs forever #6105

TaskRun with InvalidImageName runs forever #6105

chmouel commented Feb 3, 2023

pritidesai commented May 2, 2023

pritidesai commented May 16, 2023

JeromeJu commented Jun 13, 2023

afrittoli commented Jun 17, 2023

afrittoli commented Jun 17, 2023

afrittoli commented Jul 11, 2023 •

edited

Loading

afrittoli commented Jul 12, 2023 •

edited

Loading

afrittoli commented Jul 12, 2023

lbernick commented Jul 12, 2023

vdemeester commented Jul 13, 2023

TaskRun with InvalidImageName runs forever #6105

TaskRun with InvalidImageName runs forever #6105

Comments

chmouel commented Feb 3, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

pritidesai commented May 2, 2023

pritidesai commented May 16, 2023

JeromeJu commented Jun 13, 2023

afrittoli commented Jun 17, 2023

afrittoli commented Jun 17, 2023

afrittoli commented Jul 11, 2023 • edited Loading

afrittoli commented Jul 12, 2023 • edited Loading

afrittoli commented Jul 12, 2023

lbernick commented Jul 12, 2023

vdemeester commented Jul 13, 2023

afrittoli commented Jul 11, 2023 •

edited

Loading

afrittoli commented Jul 12, 2023 •

edited

Loading