Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble up the image related error reason to taskrun status #4846

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions pkg/pod/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ const (
// to resource constraints on the node
ReasonExceededNodeResources = "ExceededNodeResources"

// ReasonPullImageFailed indicates that the TaskRun's pod failed to pull image
ReasonPullImageFailed = "PullImageFailed"

// ReasonCreateContainerConfigError indicates that the TaskRun failed to create a pod due to
// config error of container
ReasonCreateContainerConfigError = "CreateContainerConfigError"
Expand Down Expand Up @@ -314,6 +317,8 @@ func updateIncompleteTaskRunStatus(trs *v1beta1.TaskRunStatus, pod *corev1.Pod)
markStatusRunning(trs, ReasonExceededNodeResources, "TaskRun Pod exceeded available resources")
case isPodHitConfigError(pod):
markStatusFailure(trs, ReasonCreateContainerConfigError, "Failed to create pod due to config error")
case isPullImageError(pod):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would make sense to mark the taskrun status failed for some of these errors? e.g. "invalidImageName" is unlikely to succeed or be retryable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had thought about this, I tried generating an invalidImageName error for the pod and trying to check the pod's phase. I found that the phase is Pending and not Failed.

image

I think this is because of the following reason.
"This happens because it a fixable error from the Pod standpoint (you can mutate the image of a container in a Pod), and thus doesn't seem to be considered as a "terminal" error."

So I thought maybe we could borrow the status of the pod and not mark the taskrun failed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could definitely work, and I'm happy to approve the PR as is since that's a reasonable route, however the user isn't really meant to interact with the taskrun pod, i.e. the taskrun controller would be responsible for updating the pod and it wouldn't know how to do so here, so I think it's most reasonable to mark as failed.
Curious what others think about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @lbernick do you think what kind of error happened we should mark the taskrun as failed?

I think the ErrInvalidImageName and ErrImageInspect is not fixable error so maybe we should mark the taskrun as failed when the pod has one of the two errors?

var (
	// ErrImagePullBackOff - Container image pull failed, kubelet is backing off image pull
	ErrImagePullBackOff = errors.New("ImagePullBackOff")

	// ErrImageInspect - Unable to inspect image
	ErrImageInspect = errors.New("ImageInspectError")

	// ErrImagePull - General image pull error
	ErrImagePull = errors.New("ErrImagePull")

	// ErrImageNeverPull - Required Image is absent on host and PullPolicy is NeverPullImage
	ErrImageNeverPull = errors.New("ErrImageNeverPull")

	// ErrRegistryUnavailable - Get http error when pulling image from registry
	ErrRegistryUnavailable = errors.New("RegistryUnavailable")

	// ErrInvalidImageName - Unable to parse the image name.
	ErrInvalidImageName = errors.New("InvalidImageName")
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit more:

  • If we mark the TaskRun as pending still, it's not a great user experience because the user isn't intended to interact with the pod and modify the image name, and it'll eventually fail due to timeout (as I mentioned previously)
  • However, if we mark the TaskRun as failed, the pod will still be running, and if in theory someone edited the image, the pod could run to completion. If that happened, we wouldn't want the TaskRun to display as failed.

I think the best approach for right now is to mark the TaskRun as pending, preserving existing behavior. Then, we should add functionality that cancels the TaskRun when the pod encounters an error such as InvalidImageName (not as part of this PR-- can just create an issue to track). I think someone is implementing this functionality to cancel pods when the TaskRuns time out (#4618) which could probably be reused here.

What do you think? If you agree, happy to LGTM.

Copy link
Member

@chmouel chmouel Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started to see this issue on clusters, when InvalidImageName end up having the PR running forever (or until default timeout of 60mn). I think it would be a better user experience if the taskruns would get canceled when encountering those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chmouel would you mind creating an issue to track this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep here: #6105

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I don't think I would like to support in Tekton a case where the users edit the underlying Pod to fix an ErrInvalidImageName error. I would consider that as a permanent error from Tekton perspective.
However, I do agree that we should not let a Pod running when the TaskRun is marked as failed.
If k8s behaviour is to wait even in case of invalid image name, we would then need to stop the pod somehow because we could fail the TaskRun.

markStatusRunning(trs, ReasonPullImageFailed, getWaitingMessage(pod))
default:
markStatusRunning(trs, ReasonPending, getWaitingMessage(pod))
}
Expand Down Expand Up @@ -408,6 +413,34 @@ func isPodHitConfigError(pod *corev1.Pod) bool {
return false
}

// isPullImageError returns true if the Pod's status indicates there are any error when pulling image
func isPullImageError(pod *corev1.Pod) bool {
for _, containerStatus := range pod.Status.ContainerStatuses {
if containerStatus.State.Waiting != nil && isImageErrorReason(containerStatus.State.Waiting.Reason) {
return true
}
}
return false
}

func isImageErrorReason(reason string) bool {
// Reference from https://github.com/kubernetes/kubernetes/blob/a1c8e9386af844757333733714fa1757489735b3/pkg/kubelet/images/types.go#L26
imageErrorReasons := []string{
"ImagePullBackOff",
"ImageInspectError",
"ErrImagePull",
"ErrImageNeverPull",
"RegistryUnavailable",
"InvalidImageName",
}
for _, imageReason := range imageErrorReasons {
if imageReason == reason {
return true
}
}
return false
}

func getWaitingMessage(pod *corev1.Pod) string {
// First, try to surface reason for pending/unknown about the actual build step.
for _, status := range pod.Status.ContainerStatuses {
Expand Down
40 changes: 40 additions & 0 deletions pkg/pod/status_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -988,6 +988,46 @@ func TestMakeTaskRunStatus(t *testing.T) {
CompletionTime: &metav1.Time{Time: time.Now()},
},
},
}, {
desc: "when pod is pending because of pulling image then the error should bubble up to taskrun status",
pod: corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod",
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "step-first",
}},
},
Status: corev1.PodStatus{
Phase: corev1.PodPending,
ContainerStatuses: []corev1.ContainerStatus{{
Name: "step-first",
State: corev1.ContainerState{
Waiting: &corev1.ContainerStateWaiting{
Reason: "ImagePullBackOff",
Message: "Back-off pulling image xxxxxxx",
},
},
}},
},
},
want: v1beta1.TaskRunStatus{
Status: statusPending("PullImageFailed", `build step "step-first" is pending with reason "Back-off pulling image xxxxxxx"`),
TaskRunStatusFields: v1beta1.TaskRunStatusFields{
Steps: []v1beta1.StepState{{
ContainerState: corev1.ContainerState{
Waiting: &corev1.ContainerStateWaiting{
Reason: "ImagePullBackOff",
Message: "Back-off pulling image xxxxxxx",
},
},
Name: "first",
ContainerName: "step-first",
}},
Sidecars: []v1beta1.SidecarState{},
},
},
}} {
t.Run(c.desc, func(t *testing.T) {
now := metav1.Now()
Expand Down