chore: initial k8s rocm support [CM-367] #9794

NicholasBlaskey · 2024-08-05T13:40:33Z

Ticket

Description

Adds initial experimental support for amd gpus on k8s.

Test Plan

Unit tests cover this change

Also manually verified on amd hardware. Automated e2e tests were deemed not in scope due to hardware availability issues

blocked on determined-ai/environments#275. We just need the environment images to land in the same release as this. Some additional docs will be added in the environments pr detailing what our rocm environment does and does not support.

Checklist

Changes have been manually QA'd
New features have been approved by the corresponding PM
User-facing API changes have the "User-facing API Change" label
Release notes have been added as a separate file under docs/release-notes/
See Release Note for details.
Licenses have been included for new code which was copied and/or modified from any external code

netlify · 2024-08-05T13:40:50Z

✅ Deploy Preview for determined-ui ready!

Name	Link
🔨 Latest commit	`0a054f1`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66bd0129a7e6cf0008b1056b
😎 Deploy Preview	https://deploy-preview-9794--determined-ui.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2024-08-05T13:50:17Z

Codecov Report

Attention: Patch coverage is 80.43478% with 9 lines in your changes missing coverage. Please review.

Project coverage is 54.01%. Comparing base (91d0b67) to head (0a054f1).
Report is 4 commits behind head on main.

Files	Patch %	Lines
master/internal/rm/kubernetesrm/jobs.go	84.84%	5 Missing ⚠️
master/internal/rm/kubernetesrm/spec.go	72.72%	3 Missing ⚠️
master/internal/config/resource_manager_config.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9794      +/-   ##
==========================================
- Coverage   54.38%   54.01%   -0.38%     
==========================================
  Files        1261     1261              
  Lines      155770   155795      +25     
  Branches     3540     3539       -1     
==========================================
- Hits        84711    84146     -565     
- Misses      70921    71511     +590     
  Partials      138      138

Flag	Coverage Δ
backend	`45.26% <80.43%> (+0.02%)`	⬆️
harness	`70.60% <ø> (-2.02%)`	⬇️
web	`53.70% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
master/internal/config/resource_manager_config.go	`66.66% <50.00%> (-0.55%)`	⬇️
master/internal/rm/kubernetesrm/spec.go	`84.64% <72.72%> (+2.31%)`	⬆️
master/internal/rm/kubernetesrm/jobs.go	`71.71% <84.84%> (+0.26%)`	⬆️

... and 13 files with indirect coverage changes

kkunapuli · 2024-08-05T15:49:08Z

docs/reference/deploy/master-config-reference.rst

@@ -403,14 +403,21 @@ resource pool ``max_slots_per_pod``.
 ``slot_type``
 -------------

-Resource type used for compute tasks. Defaults to ``cuda``.
+Resource type used for compute tasks. Valid options are ``gpu``, ``cuda``, ``cpu``, or ``rocm``.
+Defaults to ``cuda``.


Does it default to cuda or gpu? The helm config reference says gpu?

The master config defaults to cuda, helm defaults to gpu

kkunapuli · 2024-08-05T15:50:33Z

docs/release-notes/rocm-k8s-gpu.rst

+**New Features**
+
+-  Kubernetes: Experimental support for AMD ROCM GPUs is now available for Kubernetes. To use set
+   ``slotType=rocm``. See :ref:`helm-config-reference` for more details.


nit: I think the docs style guideline prefers visit over see.

kkunapuli

LGTM! I left a couple questions, but nothing blocking.

kkunapuli · 2024-08-05T15:55:03Z

master/internal/config/resource_manager_config.go

-	case device.CPU, device.CUDA:
-	case device.ROCM:
-		checkSlotType = errors.Errorf("rocm slot_type is not supported yet on k8s")
+	case device.CPU, device.CUDA, device.ROCM:


Does the code detect when the user inputs gpu and switches it to cuda? This makes it seem like gpu isn't actually accepted for slotType?

Yeah we switch it here

determined/master/internal/config/resource_manager_config.go

Line 297 in 80822eb

k.SlotType = device.CUDA

cla-bot bot added the cla-signed label Aug 5, 2024

NicholasBlaskey force-pushed the k8s_rocm branch from 35bc0df to 763a121 Compare August 5, 2024 13:46

determined-ci requested a review from a team August 5, 2024 15:27

determined-ci added the documentation Improvements or additions to documentation label Aug 5, 2024

NicholasBlaskey changed the title ~~chore: initial k8s rocm support~~ chore: initial k8s rocm support [CM-367] Aug 5, 2024

NicholasBlaskey marked this pull request as ready for review August 5, 2024 15:34

NicholasBlaskey requested review from a team as code owners August 5, 2024 15:35

NicholasBlaskey requested a review from kkunapuli August 5, 2024 15:35

kkunapuli reviewed Aug 5, 2024

View reviewed changes

kkunapuli approved these changes Aug 5, 2024

View reviewed changes

NicholasBlaskey added 4 commits August 14, 2024 15:07

chore: initial k8s rocm support

7be43b7

Fix tests

00ec942

add docs

7f62aa9

doc feedback

0a054f1

NicholasBlaskey force-pushed the k8s_rocm branch from d00685d to 0a054f1 Compare August 14, 2024 19:10

NicholasBlaskey merged commit e5d4b7f into main Aug 14, 2024
88 of 100 checks passed

NicholasBlaskey deleted the k8s_rocm branch August 14, 2024 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: initial k8s rocm support [CM-367] #9794

chore: initial k8s rocm support [CM-367] #9794

NicholasBlaskey commented Aug 5, 2024 •

edited

Loading

netlify bot commented Aug 5, 2024 •

edited

Loading

codecov bot commented Aug 5, 2024 •

edited

Loading

kkunapuli Aug 5, 2024

NicholasBlaskey Aug 5, 2024

kkunapuli Aug 5, 2024

kkunapuli left a comment

kkunapuli Aug 5, 2024

NicholasBlaskey Aug 5, 2024

kkunapuli Aug 5, 2024

chore: initial k8s rocm support [CM-367] #9794

chore: initial k8s rocm support [CM-367] #9794

Conversation

NicholasBlaskey commented Aug 5, 2024 • edited Loading

Ticket

Description

Test Plan

Checklist

netlify bot commented Aug 5, 2024 • edited Loading

✅ Deploy Preview for determined-ui ready!

codecov bot commented Aug 5, 2024 • edited Loading

Codecov Report

kkunapuli Aug 5, 2024

Choose a reason for hiding this comment

NicholasBlaskey Aug 5, 2024

Choose a reason for hiding this comment

kkunapuli Aug 5, 2024

Choose a reason for hiding this comment

kkunapuli left a comment

Choose a reason for hiding this comment

kkunapuli Aug 5, 2024

Choose a reason for hiding this comment

NicholasBlaskey Aug 5, 2024

Choose a reason for hiding this comment

kkunapuli Aug 5, 2024

Choose a reason for hiding this comment

NicholasBlaskey commented Aug 5, 2024 •

edited

Loading

netlify bot commented Aug 5, 2024 •

edited

Loading

codecov bot commented Aug 5, 2024 •

edited

Loading