Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support node selectors & affinities for Kubernetes resource pools #9428

Merged
merged 2 commits into from
Jun 17, 2024

Conversation

carolinaecalderon
Copy link
Contributor

@carolinaecalderon carolinaecalderon commented May 28, 2024

Ticket

RM-255

Description

Defining Determined resource pools on Kubernetes in terms of node selectors or affinities (using the resource pool’s task_container_defaults.cpu_spec/ gpu_spec) should be a supported feature. To achieve this,

  • In the UI, the resource pool slot counts incorrectly reflect the state of the entire cluster because the GetAgents API is unaware of node selectors (and affinities). Slot counts now respect node selectors (and affinities with requiredDuringSchedulingIgnoredDuringExecution). This can be manually tested by spinning up a cluster that splits a resource pool by way of node selectors.
  • Update the docs to describe using node selectors/affinities to define multiple resource pools in one k8s cluster.
  • Add tests for the feature.
  • Everywhere that taints/tolerations get special consideration, so do node selectors/affinities.

In this ticket, I have to import k8s.io/component-helpers v0.20.1 -- this explains the go.sum and go.mod changes.

Carolina's Open Questions:

  • Can/should nodes be allowed to match to multiple resource pools? A many-to-many mapping? (Current tests allow a many-to-many mapping)
  • When setting a resource pool's task_container_defaults, we use the GPUPodSpec in all cases where more than one slot is requested. However, when we get the node/resource pool mapping in the jobs service, and when we check node selectors/affinities, we use the GPUPodSpec only in cases where the slotType == device.CUDA. Which condition should we use consistently? --> I chose to mirror the logic for taints/tolerations in getNodeResourcePoolMapping.

Test Plan

See attached tests.

To manually test (not necessary), you can spin up your own minikube cluster with multiple node pools, and configure the determined cluster to match the node labels of these pools.

Suppose you have three nodes in a k8s cluster, named carolina, carolina-m02, and carolina-m03. Add the following to your devcluster:

 resource_pools:
        - pool_name: m1
          task_container_defaults:
            gpu_pod_spec:
              apiVersion: v1
              kind: Pod
              spec:
                nodeSelector:
                  kubernetes.io/hostname: carolina
        - pool_name: m2
          task_container_defaults:
            gpu_pod_spec:
              apiVersion: v1
              kind: Pod
              spec:
                nodeSelector:
                  kubernetes.io/hostname: carolina-m02
        - pool_name: m3
          task_container_defaults:
            gpu_pod_spec:
              apiVersion: v1
              kind: Pod
              spec:
                nodeSelector:
                  kubernetes.io/hostname: carolina-m03    

Then, to specify that an experiment runs on a specific resource pool, add the following to your experiment.yaml:

resources: 
   resource_pool: m3

Make sure that the experiment is running on the right node pool and that the webUI & GetAgents API call properly reflects this.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label May 28, 2024
Copy link

codecov bot commented May 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.30%. Comparing base (f9ba7f4) to head (01c9b1d).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9428      +/-   ##
==========================================
+ Coverage   49.28%   49.30%   +0.01%     
==========================================
  Files        1242     1242              
  Lines      161444   161471      +27     
  Branches     2868     2868              
==========================================
+ Hits        79570    79606      +36     
+ Misses      81702    81693       -9     
  Partials      172      172              
Flag Coverage Δ
backend 43.94% <100.00%> (+0.04%) ⬆️
harness 63.81% <ø> (ø)
web 44.87% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/jobs.go 76.48% <100.00%> (+1.17%) ⬆️
master/internal/rm/kubernetesrm/resource_pool.go 73.63% <100.00%> (+0.06%) ⬆️

... and 5 files with indirect coverage changes

Base automatically changed from stoksc/feat/pods2jobs to stoksc/feat/kubernetesjobs May 29, 2024 00:52
Base automatically changed from stoksc/feat/kubernetesjobs to main May 31, 2024 11:38
Copy link

netlify bot commented Jun 4, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 01c9b1d
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/666c7a3bd401500008701128

@determined-ci determined-ci requested a review from a team June 6, 2024 19:51
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jun 6, 2024
@carolinaecalderon carolinaecalderon changed the title Carolinac/rm 255 feat: support node selectors for Kubernetes resource pools Jun 7, 2024
@@ -677,6 +677,7 @@ func (p k8sJobResource) Start(
spec: spec,
slots: p.slots,
rank: rri.AgentRank,
resourcePool: p.req.ResourcePool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a bug here -- we reference this field when we start a job in master/internal/rm/kubernetesrm/jobs.go, but we never set it on the resource pool level.
In fact, we only set this field when we attempt to restore resources/reattach a job.

@carolinaecalderon carolinaecalderon marked this pull request as ready for review June 7, 2024 18:10
@carolinaecalderon carolinaecalderon requested review from a team as code owners June 7, 2024 18:10
@carolinaecalderon carolinaecalderon marked this pull request as draft June 7, 2024 19:08
@carolinaecalderon carolinaecalderon marked this pull request as ready for review June 10, 2024 21:22
go.mod Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/jobs.go Show resolved Hide resolved
Copy link
Contributor

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested edits and clarification

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

really nice tests

master/internal/rm/kubernetesrm/jobs.go Outdated Show resolved Hide resolved
@determined-ci determined-ci requested a review from a team June 14, 2024 15:44
@carolinaecalderon carolinaecalderon changed the title feat: support node selectors for Kubernetes resource pools feat: support node selectors & affinities for Kubernetes resource pools Jun 14, 2024
@carolinaecalderon carolinaecalderon merged commit 63a4163 into main Jun 17, 2024
85 of 98 checks passed
@carolinaecalderon carolinaecalderon deleted the carolinac/rm-255 branch June 17, 2024 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants