Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort consoles with more than one pod #264

Merged
merged 3 commits into from
Mar 14, 2022

Conversation

ttamimi
Copy link
Contributor

@ttamimi ttamimi commented Mar 8, 2022

Currently we are setting a console's Job's restartPolicy to "Never", so if
any container within the Pod fails then the Job won't spawn another Pod. We
are also setting the Job's backoffLimit to 0, so if the Pod itself fails
for any reason then the Job won't spawn another Pod either. These two
settings together prevent a second Pod from being launched when a failure
occurs.

However, these settings don't cover situations where a console’s Pod is
deleted. This can happen due to manual deletion, Pod eviction, or Pod
preemption. There are no Job settings to prevent relaunching a Pod that has
disappeared in one of these ways.

Launching a subsequent console Pod beyond the first one is problematic, even
if there is only one running Pod at any given time. A subsequent Pod causes
the workloads controller to enter its state machine logic in a way that it
wasn't designed to handle. It also causes the console to remain in a running
state for far longer than users expect.

With this change, the workloads controller stops a console and deletes its
resources if it detects that more than one Pod belongs (or belonged) to that
console.

@ttamimi ttamimi force-pushed the CI-1233/abort-consoles-with-multiple-pods branch 4 times, most recently from 2253691 to cb910ad Compare March 8, 2022 17:24
@ttamimi ttamimi marked this pull request as draft March 8, 2022 17:53
@ttamimi ttamimi force-pushed the CI-1233/abort-consoles-with-multiple-pods branch 5 times, most recently from 2c52ed5 to 492f299 Compare March 8, 2022 20:49
Currently we are setting a console's Job's `restartPolicy` to "Never", so if
any container within the Pod fails then the Job won't spawn another Pod. We
are also setting the Job's `backoffLimit` to 0, so if the Pod itself fails
for any reason then the Job won't spawn another Pod either. These two
settings together prevent a second Pod from being launched when a failure
occurs.

However, these settings don't cover situations where a console’s Pod is
deleted. This can happen due to manual deletion, Pod eviction, or Pod
preemption. There are no Job settings to prevent relaunching a Pod that has
disappeared in one of these ways.

Launching a subsequent console Pod beyond the first one is problematic, even
if there is only one running Pod at any given time. A subsequent Pod causes
the workloads controller to enter its state machine logic in a way that it
wasn't designed to handle. It also causes the console to remain in a running
state for far longer than users expect.

With this change, the workloads controller stops a console and deletes its
resources if it detects that more than one Pod belongs (or belonged) to that
console.
@ttamimi ttamimi force-pushed the CI-1233/abort-consoles-with-multiple-pods branch from 492f299 to 2924d94 Compare March 8, 2022 20:51
@ttamimi ttamimi marked this pull request as ready for review March 8, 2022 21:03
}
}
if podDeleteError != nil {
return errors.Wrap(podDeleteError, "failed to delete pod(s)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this, as the Job will own the pods. Deleting the job should remove all of the pods for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there's no guarantee that the job controller will get the deletion event prior to being notified of it's pods being removed, although this is unlikely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, in theory we shouldn't have to delete pods. They are owned by the console job. A delete operation should cascade. However, when I tested manual deletion, for some reason the newly launched second pod lingers on indefinitely after the job is gone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed, I’ve added a comment in the source code about this.

@theobarberbany
Copy link
Contributor

theobarberbany commented Mar 10, 2022

This doesn't appear to work for eviction :(
Screenshot 2022-03-10 at 11 54 05

edit: it takes ~1-2m for the new pod to be removed after eviction of the old one, this will be waiting for the reconcile loop to be triggered. For non-interactive commands we partially re-run before being deleted.

@ttamimi
Copy link
Contributor Author

ttamimi commented Mar 13, 2022

Yes spot on — because Kubernetes controllers are independent, there is no guaranteed way to prevent the job controller from spawning a replacement pod when the initial pod is evicted or preempted. The only solution is for the engineering teams to make their non-interactive consoles idempotent.

As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job.

@theobarberbany
Copy link
Contributor

As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job.

Yep, it feels like this may require a re-think.

All things considered this will give us what we need for now!

@theobarberbany theobarberbany merged commit 42854e9 into master Mar 14, 2022
@theobarberbany theobarberbany deleted the CI-1233/abort-consoles-with-multiple-pods branch March 14, 2022 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants