-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Kubernetes arbitrary node selector #1365
Comments
Hopefully closed with the upcoming CML release, v0.19.0. Thank you very much, @ludelafo! And... cough it would be awesome if you could test it on your side and tell us houw it goes. 😅 |
Great news! Thanks! I'll test the new release in the following days and report back to you. |
Seems to be working well from my testing, thank you again! |
Awesome, thanks for your patience. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
We have an on-premises Kubernetes cluster with multiple nodes. Some of these nodes have dedicated GPUs.
We would like to use CML to train our models on those specific nodes. Kubernetes supports many ways to specify on which node a pod should be run:
Examples
nodeSelector
Assigning a pod to a node with
nodeSelector
is as simple as labeling a node with a certain tag and add anodeSelector
to the pod configuration. Example inspired by the Assign Pods to Nodes [withnodeSelector
] documentation:# Apply the Pod configuration kubectl apply -f pod-nginx.yaml
The pod will be created on the node with the label
disktype=ssd
.Bonus: Assign a pod to a node by its name
You can also assign a pod to a node using the node's name. Example inspired by the Assign Pods to Nodes [with
nodeSelector
] #Create a pod that gets scheduled to specific node documentation:Affinity and anti-affinity
Another way to assign a pod to a node is using the affinity and anti-affinity feature. Example inspired by the Assigning Pods to Nodes #Node affinity documentation:
The
requiredDuringSchedulingIgnoredDuringExecution
means that the pod will be scheduled on a node that has the labeldisktype=ssd
. If no node has the label, the pod will not be scheduled.The
preferredDuringSchedulingIgnoredDuringExecution
allows to specify a preference for a node that has the labeldisktype=ssd
. If no node has the label, the pod will be scheduled on any node.The affinity and anti-affinity feature is more powerful than the
nodeSelector
feature. It allows to specify more complex rules. For example, you can specify that a pod should be scheduled on a node that has the labeldisktype=ssd
and the labelgputype=k80
.CML and Kubernetes abritary node selector
As discussed with @0x2b3bfa0 on Discord, the only node selector supported by CML at the moment is the
accelerator
node selector (https://github.com/iterative/terraform-provider-iterative/blob/ce4f3bec2300b3a15d615f3456575794f829f72b/iterative/kubernetes/provider.go#L84).Of my understanding, the
accelerator
node selector is set by Cloud providers when you use a GPU instance. We could set thoseaccelerator
labels on our nodes so that CML could then use when using the--cloud-gpu
flag. It could work but it seems a bit hacky. Allowing a--cloud-k8s-node-selector
flag would be more flexible as any labels could be set and used.It seems to me that the implementation of the
nodeSelector
feature would be rather simple (I have not thought about the implementation for theaffinity
feature yet as it seems more complex):nodeSelector
field for the Kubernetes pod.--cloud-k8s-node-selector
flag (cml/src/terraform.js
Line 84 in 1be24ed
cml/bin/cml/runner/launch.js
Line 113 in 1be24ed
cml/bin/cml/runner/launch.js
Line 457 in 1be24ed
I would like to know if you are interested in this and if you have any feedback on this proposal.
I would be willing to add support for this feature in CML. Any resources to start implementing it (your contribution guide, your usual workflows, etc.) would be helpful to get started. I'm available on Discord if you want to discuss this further with the same username.
The text was updated successfully, but these errors were encountered: