Abort consoles with more than one pod #264

ttamimi · 2022-03-08T16:05:23Z

Currently we are setting a console's Job's restartPolicy to "Never", so if
any container within the Pod fails then the Job won't spawn another Pod. We
are also setting the Job's backoffLimit to 0, so if the Pod itself fails
for any reason then the Job won't spawn another Pod either. These two
settings together prevent a second Pod from being launched when a failure
occurs.

However, these settings don't cover situations where a console’s Pod is
deleted. This can happen due to manual deletion, Pod eviction, or Pod
preemption. There are no Job settings to prevent relaunching a Pod that has
disappeared in one of these ways.

Launching a subsequent console Pod beyond the first one is problematic, even
if there is only one running Pod at any given time. A subsequent Pod causes
the workloads controller to enter its state machine logic in a way that it
wasn't designed to handle. It also causes the console to remain in a running
state for far longer than users expect.

With this change, the workloads controller stops a console and deletes its
resources if it detects that more than one Pod belongs (or belonged) to that
console.

Currently we are setting a console's Job's `restartPolicy` to "Never", so if any container within the Pod fails then the Job won't spawn another Pod. We are also setting the Job's `backoffLimit` to 0, so if the Pod itself fails for any reason then the Job won't spawn another Pod either. These two settings together prevent a second Pod from being launched when a failure occurs. However, these settings don't cover situations where a console’s Pod is deleted. This can happen due to manual deletion, Pod eviction, or Pod preemption. There are no Job settings to prevent relaunching a Pod that has disappeared in one of these ways. Launching a subsequent console Pod beyond the first one is problematic, even if there is only one running Pod at any given time. A subsequent Pod causes the workloads controller to enter its state machine logic in a way that it wasn't designed to handle. It also causes the console to remain in a running state for far longer than users expect. With this change, the workloads controller stops a console and deletes its resources if it detects that more than one Pod belongs (or belonged) to that console.

theobarberbany · 2022-03-09T17:36:31Z

controllers/workloads/console/controller.go

+		}
+	}
+	if podDeleteError != nil {
+		return errors.Wrap(podDeleteError, "failed to delete pod(s)")


I'm not sure we need this, as the Job will own the pods. Deleting the job should remove all of the pods for us.

IIRC there's no guarantee that the job controller will get the deletion event prior to being notified of it's pods being removed, although this is unlikely.

Yes indeed, in theory we shouldn't have to delete pods. They are owned by the console job. A delete operation should cascade. However, when I tested manual deletion, for some reason the newly launched second pod lingers on indefinitely after the job is gone.

As agreed, I’ve added a comment in the source code about this.

theobarberbany · 2022-03-10T11:54:18Z

This doesn't appear to work for eviction :(

edit: it takes ~1-2m for the new pod to be removed after eviction of the old one, this will be waiting for the reconcile loop to be triggered. For non-interactive commands we partially re-run before being deleted.

ttamimi · 2022-03-13T23:23:48Z

Yes spot on — because Kubernetes controllers are independent, there is no guaranteed way to prevent the job controller from spawning a replacement pod when the initial pod is evicted or preempted. The only solution is for the engineering teams to make their non-interactive consoles idempotent.

As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job.

theobarberbany · 2022-03-14T12:15:14Z

As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job.

Yep, it feels like this may require a re-think.

All things considered this will give us what we need for now!

ttamimi force-pushed the CI-1233/abort-consoles-with-multiple-pods branch 4 times, most recently from 2253691 to cb910ad Compare March 8, 2022 17:24

ttamimi marked this pull request as draft March 8, 2022 17:53

ttamimi force-pushed the CI-1233/abort-consoles-with-multiple-pods branch 5 times, most recently from 2c52ed5 to 492f299 Compare March 8, 2022 20:49

ttamimi force-pushed the CI-1233/abort-consoles-with-multiple-pods branch from 492f299 to 2924d94 Compare March 8, 2022 20:51

ttamimi marked this pull request as ready for review March 8, 2022 21:03

theobarberbany reviewed Mar 9, 2022

View reviewed changes

Source code comments

dcf37eb

Bump version

a7adac9

theobarberbany approved these changes Mar 14, 2022

View reviewed changes

theobarberbany merged commit 42854e9 into master Mar 14, 2022

theobarberbany deleted the CI-1233/abort-consoles-with-multiple-pods branch March 14, 2022 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort consoles with more than one pod #264

Abort consoles with more than one pod #264

ttamimi commented Mar 8, 2022 •

edited

Loading

theobarberbany Mar 9, 2022

theobarberbany Mar 10, 2022

ttamimi Mar 13, 2022

ttamimi Mar 13, 2022

theobarberbany commented Mar 10, 2022 •

edited

Loading

ttamimi commented Mar 13, 2022

theobarberbany commented Mar 14, 2022

Abort consoles with more than one pod #264

Abort consoles with more than one pod #264

Conversation

ttamimi commented Mar 8, 2022 • edited Loading

theobarberbany Mar 9, 2022

Choose a reason for hiding this comment

theobarberbany Mar 10, 2022

Choose a reason for hiding this comment

ttamimi Mar 13, 2022

Choose a reason for hiding this comment

ttamimi Mar 13, 2022

Choose a reason for hiding this comment

theobarberbany commented Mar 10, 2022 • edited Loading

ttamimi commented Mar 13, 2022

theobarberbany commented Mar 14, 2022

ttamimi commented Mar 8, 2022 •

edited

Loading

theobarberbany commented Mar 10, 2022 •

edited

Loading