-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort consoles with more than one pod #264
Abort consoles with more than one pod #264
Conversation
2253691
to
cb910ad
Compare
2c52ed5
to
492f299
Compare
Currently we are setting a console's Job's `restartPolicy` to "Never", so if any container within the Pod fails then the Job won't spawn another Pod. We are also setting the Job's `backoffLimit` to 0, so if the Pod itself fails for any reason then the Job won't spawn another Pod either. These two settings together prevent a second Pod from being launched when a failure occurs. However, these settings don't cover situations where a console’s Pod is deleted. This can happen due to manual deletion, Pod eviction, or Pod preemption. There are no Job settings to prevent relaunching a Pod that has disappeared in one of these ways. Launching a subsequent console Pod beyond the first one is problematic, even if there is only one running Pod at any given time. A subsequent Pod causes the workloads controller to enter its state machine logic in a way that it wasn't designed to handle. It also causes the console to remain in a running state for far longer than users expect. With this change, the workloads controller stops a console and deletes its resources if it detects that more than one Pod belongs (or belonged) to that console.
492f299
to
2924d94
Compare
} | ||
} | ||
if podDeleteError != nil { | ||
return errors.Wrap(podDeleteError, "failed to delete pod(s)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this, as the Job will own the pods. Deleting the job should remove all of the pods for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC there's no guarantee that the job controller will get the deletion event prior to being notified of it's pods being removed, although this is unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, in theory we shouldn't have to delete pods. They are owned by the console job. A delete operation should cascade. However, when I tested manual deletion, for some reason the newly launched second pod lingers on indefinitely after the job is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As agreed, I’ve added a comment in the source code about this.
Yes spot on — because Kubernetes controllers are independent, there is no guaranteed way to prevent the job controller from spawning a replacement pod when the initial pod is evicted or preempted. The only solution is for the engineering teams to make their non-interactive consoles idempotent. As discussed, we should probably re-think our approach of using jobs for consoles. In our code we are making a lot of effort to either prevent or circumvent retries, which kind of defeats the purpose of using a job. |
Yep, it feels like this may require a re-think. All things considered this will give us what we need for now! |
Currently we are setting a console's Job's
restartPolicy
to "Never", so ifany container within the Pod fails then the Job won't spawn another Pod. We
are also setting the Job's
backoffLimit
to 0, so if the Pod itself failsfor any reason then the Job won't spawn another Pod either. These two
settings together prevent a second Pod from being launched when a failure
occurs.
However, these settings don't cover situations where a console’s Pod is
deleted. This can happen due to manual deletion, Pod eviction, or Pod
preemption. There are no Job settings to prevent relaunching a Pod that has
disappeared in one of these ways.
Launching a subsequent console Pod beyond the first one is problematic, even
if there is only one running Pod at any given time. A subsequent Pod causes
the workloads controller to enter its state machine logic in a way that it
wasn't designed to handle. It also causes the console to remain in a running
state for far longer than users expect.
With this change, the workloads controller stops a console and deletes its
resources if it detects that more than one Pod belongs (or belonged) to that
console.