Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: filter device instance IDs by constraints #18141

Merged
merged 1 commit into from
Aug 3, 2023
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Aug 3, 2023

When the scheduler assigns a device instance, it iterates over the feasible devices and then picks the first instance with availability. If the jobspec uses a constraint on device ID, this can lead to buggy/surprising behavior where the node's device matches the constraint but then the individual device instance does not.

Add a second filter based on the ${device.ids} constraint after selecting a node's device to ensure the device instance ID falls within the constraint as well.

Fixes: #18112
See #18112 (comment) for reproduction steps.
Will get backported to 1.6.x, 1.5.x, and 1.4.x

When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross merged commit b1742c7 into main Aug 3, 2023
@tgross tgross deleted the device-assign branch August 3, 2023 18:58
tgross added a commit that referenced this pull request Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
tgross added a commit that referenced this pull request Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
tgross added a commit that referenced this pull request Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
@tgross
Copy link
Member Author

tgross commented Aug 3, 2023

Backported to 1.6.x, 1.5.x, and 1.4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

Cannot set GPUs using UUID as constraint
3 participants