-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
Comments
nkflash
changed the title
training-operator can not get podgroup
training-operator can not get podgroup status(inqueue) with volcano when enable gang
Jul 7, 2022
/assign @shinytang6 |
Thanks for the report @nkflash, it's a bug, would you like to fix that? |
Thanks for the issue! |
Sure,I will fix that later |
#1666 should fix this issue |
How long will it take to merge that PR? I'm also facing the same issue |
This is fixed by #1666 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I setup two new clusters, both of them meet same problem.
enable gang in training-operator
submit PyTorch job which in example directory
The job will not create pod.
I check the pod group status:
It seems the status is correct: "inqueue"
from operator log: the job in unschedule status and only show twice.
If I modify the job yaml. it will trigger the job to run(operator will receive inqueue status and create pod).
If I disable gang "--enable-gang-scheduling=false", it work well
The text was updated successfully, but these errors were encountered: