-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Kubeflow Jobs type for resource quota reclaim #1503
Comments
@panpan0000 Do you have a sample job CR I could use for testing? I'd be happy to take a stab at this. |
it's very easy to reproduce the issue but requires training-operator from kubeflow instsalled:
|
@panpan0000 I could not reproduce this issue. How to install kueue and training-operator? Also, which versions do you use kueue, training-operator, and kubernetes? |
kueue 0.5.1 |
Also, how did you install those components? Could you provide reproducing steps? |
It's wired ... I just tried again the issue gone ... sorry again @tenzen-y |
No problem :) /remove-kind feature |
I tried in kubernetes |
It's a great recording :) Thanks. |
What would you like to be added:
reclaim the resource quota when kubeflow jobs(PytorchJob, TFJobs...etc) completed.
so far, when the Job CRD existing while the pod completed, the queue's flavorUsage will still be occupied , and stop incoming pods/jobs .
Why is this needed:
it makes no sense to kill the xxjob right after finish, we still need the Pytorchjob CRD for some reason.
but kueue should reclaim the resource to reduce the
flavorUsage
after job completed.relevant issue : #1149
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: