-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support node selectors & affinities for Kubernetes resource pools #9428
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9428 +/- ##
==========================================
+ Coverage 49.28% 49.30% +0.01%
==========================================
Files 1242 1242
Lines 161444 161471 +27
Branches 2868 2868
==========================================
+ Hits 79570 79606 +36
+ Misses 81702 81693 -9
Partials 172 172
Flags with carried forward coverage won't be shown. Click here to find out more.
|
10af218
to
8806bff
Compare
d8db310
to
bd2840d
Compare
bd2840d
to
f6b6f1b
Compare
✅ Deploy Preview for determined-ui canceled.
|
e16b0e2
to
0524a4f
Compare
ba0eac4
to
548d287
Compare
@@ -677,6 +677,7 @@ func (p k8sJobResource) Start( | |||
spec: spec, | |||
slots: p.slots, | |||
rank: rri.AgentRank, | |||
resourcePool: p.req.ResourcePool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a bug here -- we reference this field when we start a job in master/internal/rm/kubernetesrm/jobs.go
, but we never set it on the resource pool level.
In fact, we only set this field when we attempt to restore resources/reattach a job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested edits and clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
really nice tests
5abc95e
to
816035d
Compare
Ticket
RM-255
Description
Defining Determined resource pools on Kubernetes in terms of node selectors or affinities (using the resource pool’s task_container_defaults.cpu_spec/ gpu_spec) should be a supported feature. To achieve this,
In this ticket, I have to import
k8s.io/component-helpers v0.20.1
-- this explains thego.sum
andgo.mod
changes.Carolina's Open Questions:
task_container_defaults
, we use theGPUPodSpec
in all cases where more than one slot is requested. However, when we get the node/resource pool mapping in the jobs service, and when we check node selectors/affinities, we use theGPUPodSpec
only in cases where theslotType == device.CUDA
. Which condition should we use consistently? --> I chose to mirror the logic for taints/tolerations ingetNodeResourcePoolMapping
.Test Plan
See attached tests.
To manually test (not necessary), you can spin up your own minikube cluster with multiple node pools, and configure the determined cluster to match the node labels of these pools.
Suppose you have three nodes in a k8s cluster, named
carolina
,carolina-m02
, andcarolina-m03
. Add the following to your devcluster:Then, to specify that an experiment runs on a specific resource pool, add the following to your experiment.yaml:
Make sure that the experiment is running on the right node pool and that the webUI & GetAgents API call properly reflects this.
Checklist
docs/release-notes/
.See Release Note for details.