You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run detAI in a K8s cluster of AMD MI250X. According to the master config reference I must set slot_type: rocm.
I noticed I can make this through values.yaml (see the master conf helm template ). But even setting it there, I still can't see the changes in the master pod.
The refered variable in the values.yaml:
# This is the number of GPUs there are per machine. Determined uses this information when scheduling
# multi-GPU tasks. Each multi-GPU (distributed training) task will be scheduled as a set of
# `slotsPerTask / maxSlotsPerPod` separate pods, with each pod assigned up to `maxSlotsPerPod` GPUs.
# Distributed tasks with sizes that are not divisible by `maxSlotsPerPod` are never scheduled. If
# you have a cluster of different size nodes (e.g., 4 and 8 GPUs per node), set `maxSlotsPerPod` to
# the greatest common divisor of all the sizes (4, in that case).
maxSlotsPerPod: 2
slot_type: rocm
Contents of master pod's /etc/determined/master.yaml:
I'm not sure if that's the reason, but when I start an experiment it is still looking for nvidia resources:
bardpeak014:~ # k describe pod det-69bc32d1-exp-8-trial-8-attempt-1-w27rg | grep -A 3 Event
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m26s (x2 over 8m52s) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: not eligible due to preemptionPolicy=Never..
For some reason slot_type in the values.yaml is being ignored during helm install.
Reproduction Steps
Change values.yaml adding slot_type in the top context.
deploy the helm chart.
Check /etc/determined/master.yaml in the master pod. resource_manager.slot_type is not there.
Expected Behavior
We should see resource_manager.slot_type: rocm in the master pod /etc/determined/master.yaml
Screenshot
Environment
Device or hardware: AMD MI250X
OS: K8s runs on top of SLES. DetAI runs in the official registry images.
Browser - No browser
Version 0.37.0
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
I'm trying to run detAI in a K8s cluster of AMD MI250X. According to the master config reference I must set
slot_type: rocm
.I noticed I can make this through
values.yaml
(see the master conf helm template ). But even setting it there, I still can't see the changes in the master pod.The refered variable in the
values.yaml
:Contents of master pod's
/etc/determined/master.yaml
:I'm not sure if that's the reason, but when I start an experiment it is still looking for nvidia resources:
For some reason
slot_type
in thevalues.yaml
is being ignored during helm install.Reproduction Steps
values.yaml
addingslot_type
in the top context./etc/determined/master.yaml
in the master pod.resource_manager.slot_type
is not there.Expected Behavior
We should see
resource_manager.slot_type: rocm
in the master pod/etc/determined/master.yaml
Screenshot
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: