-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: initial k8s rocm support [CM-367] #9794
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
35bc0df
to
763a121
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9794 +/- ##
==========================================
- Coverage 54.38% 54.01% -0.38%
==========================================
Files 1261 1261
Lines 155770 155795 +25
Branches 3540 3539 -1
==========================================
- Hits 84711 84146 -565
- Misses 70921 71511 +590
Partials 138 138
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -403,14 +403,21 @@ resource pool ``max_slots_per_pod``. | |||
``slot_type`` | |||
------------- | |||
|
|||
Resource type used for compute tasks. Defaults to ``cuda``. | |||
Resource type used for compute tasks. Valid options are ``gpu``, ``cuda``, ``cpu``, or ``rocm``. | |||
Defaults to ``cuda``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it default to cuda
or gpu
? The helm config reference says gpu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The master config defaults to cuda
, helm defaults to gpu
docs/release-notes/rocm-k8s-gpu.rst
Outdated
**New Features** | ||
|
||
- Kubernetes: Experimental support for AMD ROCM GPUs is now available for Kubernetes. To use set | ||
``slotType=rocm``. See :ref:`helm-config-reference` for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the docs style guideline prefers visit
over see
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a couple questions, but nothing blocking.
case device.CPU, device.CUDA: | ||
case device.ROCM: | ||
checkSlotType = errors.Errorf("rocm slot_type is not supported yet on k8s") | ||
case device.CPU, device.CUDA, device.ROCM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the code detect when the user inputs gpu
and switches it to cuda
? This makes it seem like gpu
isn't actually accepted for slotType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we switch it here
k.SlotType = device.CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!
d00685d
to
0a054f1
Compare
Ticket
Description
Adds initial experimental support for amd gpus on k8s.
Test Plan
Unit tests cover this change
Also manually verified on amd hardware. Automated e2e tests were deemed not in scope due to hardware availability issues
blocked on determined-ai/environments#275. We just need the environment images to land in the same release as this. Some additional docs will be added in the environments pr detailing what our rocm environment does and does not support.
Checklist
docs/release-notes/
See Release Note for details.