🐛[bug] Can't set `slot_type` through `values.yaml` - K8s deplyment #10152

caio-davi · 2024-10-29T22:54:03Z

Describe the bug

I'm trying to run detAI in a K8s cluster of AMD MI250X. According to the master config reference I must set slot_type: rocm.
I noticed I can make this through values.yaml (see the master conf helm template ). But even setting it there, I still can't see the changes in the master pod.

The refered variable in the values.yaml:


# This is the number of GPUs there are per machine. Determined uses this information when scheduling
# multi-GPU tasks. Each multi-GPU (distributed training) task will be scheduled as a set of
# `slotsPerTask / maxSlotsPerPod` separate pods, with each pod assigned up to `maxSlotsPerPod` GPUs.
# Distributed tasks with sizes that are not divisible by `maxSlotsPerPod` are never scheduled. If
# you have a cluster of different size nodes (e.g., 4 and 8 GPUs per node), set `maxSlotsPerPod` to
# the greatest common divisor of all the sizes (4, in that case).
maxSlotsPerPod: 2
slot_type: rocm

Contents of master pod's /etc/determined/master.yaml:

k exec -it determined-master-deployment-determined-5f4d9f5b79-p7n4j  cat /etc/determined/master.yaml 
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
log:
  level: "info"
  color: true

checkpoint_storage:
  type: "shared_fs"
  host_path: "/tmp/checkpoints"
  save_experiment_best: 0
  save_trial_best: 1
  save_trial_latest: 1

db:
  user: "postgres"
  password: "postgres"
  host: determined-db-service-determined
  port: 5432
  name: "determined"

security:
  initial_user_password: "Password_2024"
port: 8081

resource_manager:
  type: "kubernetes"
  default_namespace: "mlde-caio"
  max_slots_per_pod: 2
  master_service_name: determined-master-service-determined

  default_aux_resource_pool: 
  default_compute_resource_pool: 
resource_pools:
  - gpu_type: rocm
    max_slots: 2
    pool_name: default

task_container_defaults:
  gpu_pod_spec: {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[{ \"name\": \"hsn0-host\", \"interface\": \"hsn0\" },{ \"name\": \"hsn1-host\", \"interface\": \"hsn1\" }]"},"labels":{"customLabel":"gpu-label"}},"spec":{"containers":[{"env":[{"name":"HIP_VISIBLE_DEVICES","value":"0,1,2,3,4,5,6,7"}],"name":"determined-container","resources":{"limits":{"amd.com/gpu":8}},"volumeMounts":[{"mountPath":"/dev/cxi0","name":"dev-cxi0"},{"mountPath":"/dev/cxi1","name":"dev-cxi1"},{"mountPath":"/host/shared","name":"shared"}]}],"volumes":[{"hostPath":{"path":"/dev/cxi0"},"name":"dev-cxi0"},{"hostPath":{"path":"/dev/cxi1"},"name":"dev-cxi1"},{"hostPath":{"path":"/home/users/davica/workspace/detAI/shared"},"name":"shared"}]}}
  image:
     cpu: "determinedai/pytorch-ngc-dev:0736b6d"
     gpu: "determinedai/environments:rocm-5.6-pytorch-1.3-tf-2.10-rocm-mpich-0736b6d"
     rocm: "determinedai/environments:rocm-5.6-pytorch-1.3-tf-2.10-rocm-mpich-0736b6d"
telemetry:
  enabled: true

I'm not sure if that's the reason, but when I start an experiment it is still looking for nvidia resources:

bardpeak014:~ # k describe pod det-69bc32d1-exp-8-trial-8-attempt-1-w27rg | grep -A 3 Event
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m26s (x2 over 8m52s)  default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: not eligible due to preemptionPolicy=Never..

For some reason slot_type in the values.yaml is being ignored during helm install.

Reproduction Steps

Change values.yaml adding slot_type in the top context.
deploy the helm chart.
Check /etc/determined/master.yaml in the master pod. resource_manager.slot_type is not there.

Expected Behavior

We should see resource_manager.slot_type: rocm in the master pod /etc/determined/master.yaml

Screenshot

Environment

Device or hardware: AMD MI250X
OS: K8s runs on top of SLES. DetAI runs in the official registry images.
Browser - No browser
Version 0.37.0

Additional Context

No response

The text was updated successfully, but these errors were encountered:

caio-davi · 2024-10-30T03:32:14Z

My fault. I should have used slotType in the values.yaml, not slot_type.

caio-davi added the bug label Oct 29, 2024

caio-davi closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[bug] Can't set `slot_type` through `values.yaml` - K8s deplyment #10152

🐛[bug] Can't set `slot_type` through `values.yaml` - K8s deplyment #10152

caio-davi commented Oct 29, 2024

caio-davi commented Oct 30, 2024

🐛[bug] Can't set slot_type through values.yaml - K8s deplyment #10152

🐛[bug] Can't set slot_type through values.yaml - K8s deplyment #10152

Comments

caio-davi commented Oct 29, 2024

Describe the bug

Reproduction Steps

Expected Behavior

Screenshot

Environment

Additional Context

caio-davi commented Oct 30, 2024

🐛[bug] Can't set `slot_type` through `values.yaml` - K8s deplyment #10152

🐛[bug] Can't set `slot_type` through `values.yaml` - K8s deplyment #10152