-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357
Comments
If I can get an update about this request, that would be great. |
Hi @OrenZ1 - I apologize for the delay. Unfortunately, I'm unable to spend much time on EG (and Jupyter in general) these days. I think this would be a great addition. Ideally, if we can determine that a Pending state is going to remain pending until the prescribed (and long) timeout, it would better to abort. The location where we can detect this during the startup sequence is in the I hope you find that helpful but imagine you've probably poked around a bit already so let me know if this isn't what you were looking for. Thank you for your interest and helping out! |
There are multiple ways you can go about this:
Also, having what @kevin-bates proposes above would not only help your use case but also fix a file-handlers leak that I have seen in the past. |
Hi! Sorry for the delay but I managed to make a PR for the first thing we've discussed here! I am still trying to think of a way to handle kernels which are stuck on Pending state. Hope to make a different PR for that too soon :) |
Just created a new PR, which enables the option to configure different timeouts for different events which occur during startup, including a "0 seconds" timeout -which means the startup will terminate immediately after such event occurs. |
Problem
I am facing a problem when using JEG on kubernetes.
I have set kernel launch timeout to 5 mins (because I am using large images), and set MAX_KERNELS_PER_USER to 2 to prevent spamming of kernels.
When a user submits a request to launch a kernel, it gets started over a remote pod. Sometimes, the pod remains stuck on pending, i.e. due to a lack of resources which is currently affective. In this case, the user can’t submit a new kernel (with a lower resources demand), and has to wait for 5 minutes for the timeout to be affective, before using another kernel. I even thought about setting up a service which watches pending kernel pods, and if they have events which prevent them from starting, it would send a DELETE request to the gateway to kill the kernel. The problem is that when kernels are pending, the gateway can’t receive DELETE requests to kernels.
In addition, the kernel is not aware to actions done on the kubernetes cluster, so I can’t delete the pods using kubernetes API, because JEG would still wait for timeout for this kernel.
Proposed Solution
For starters, I would expect JEG to have awareness of the Kubernetes cluster it is running on, so that when kernel pods are deleted, it would stop sampling them.
For the other issue I’ve stated I can see two possible solutions:
The first one (and in my opinion, the easier one), is to allow receiving DELETE requests to kernels which are pending.
The second one is to allow to configure the JEG to kill pending kernels when they have events (or certain events) on its own. But this seems a bit trickier to think about properly.
The text was updated successfully, but these errors were encountered: