-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Lost of GPUs
problem has not been resolved completely
#1818
Comments
@eggiter: GitHub didn't allow me to assign the following users: zhiyuone. Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. |
|
Hello 👋 Looks like there was no activity on this issue for last 90 days. |
i am confused if you delete
can reproduce description be more accurate? tks @eggiter |
|
that's right. What confused me is how could |
I think the reproduction instructions are based on an older version of volcano (before #1685 was merged) when the scheduler would panic and restart if amount of resources allocatable fell below the amount being used. After the fix in #1685, the node will be marked NotReady. How to reproduce it (as minimally and precisely as possible):
|
@xkd045 Sorry, my bad. There is a KEY action missing between step3 and step4: restart scheduler. The updated reproduce steps are:
/cc @kenoung |
Hello 👋 Looks like there was no activity on this issue for last 90 days. |
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗 |
What happened:
node1
is now 6;pod1
requests 6 GPUs;pod1
to nodenode1
;Failed
because ofUnexpectedAdmissionError
;What you expected to happen:
node1
should not be scheduled any other pods;How to reproduce it (as minimally and precisely as possible):
node1
has 8 allocatable GPUs;pod0
requests 8 GPUs;node1
becomes 6;pod1
requests 6 GPUs;pod1
to nodenode1
, butpod1
turned toFailed
because ofUnexpectedAdmissionError
;Anything else we need to know?:
Main reason to cause:
Ready
when its resource is updated. But there is a problem, whenSetNode
is called and node is updated, there might be tasks which were not added back to this node due to previousAllocateFailError
.When will node be changed to ready?
), there are two possible cases that the node will becomeReady
:AllocateFailError
);AllocateFailError
.Environment:
The text was updated successfully, but these errors were encountered: