Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Can't set slot_type through values.yaml - K8s deplyment #10152

Closed
caio-davi opened this issue Oct 29, 2024 · 1 comment
Closed

🐛[bug] Can't set slot_type through values.yaml - K8s deplyment #10152

caio-davi opened this issue Oct 29, 2024 · 1 comment
Labels

Comments

@caio-davi
Copy link

Describe the bug

I'm trying to run detAI in a K8s cluster of AMD MI250X. According to the master config reference I must set slot_type: rocm.
I noticed I can make this through values.yaml (see the master conf helm template ). But even setting it there, I still can't see the changes in the master pod.

The refered variable in the values.yaml:


# This is the number of GPUs there are per machine. Determined uses this information when scheduling
# multi-GPU tasks. Each multi-GPU (distributed training) task will be scheduled as a set of
# `slotsPerTask / maxSlotsPerPod` separate pods, with each pod assigned up to `maxSlotsPerPod` GPUs.
# Distributed tasks with sizes that are not divisible by `maxSlotsPerPod` are never scheduled. If
# you have a cluster of different size nodes (e.g., 4 and 8 GPUs per node), set `maxSlotsPerPod` to
# the greatest common divisor of all the sizes (4, in that case).
maxSlotsPerPod: 2
slot_type: rocm

Contents of master pod's /etc/determined/master.yaml:

k exec -it determined-master-deployment-determined-5f4d9f5b79-p7n4j  cat /etc/determined/master.yaml 
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
log:
  level: "info"
  color: true

checkpoint_storage:
  type: "shared_fs"
  host_path: "/tmp/checkpoints"
  save_experiment_best: 0
  save_trial_best: 1
  save_trial_latest: 1

db:
  user: "postgres"
  password: "postgres"
  host: determined-db-service-determined
  port: 5432
  name: "determined"

security:
  initial_user_password: "Password_2024"
port: 8081

resource_manager:
  type: "kubernetes"
  default_namespace: "mlde-caio"
  max_slots_per_pod: 2
  master_service_name: determined-master-service-determined

  default_aux_resource_pool: 
  default_compute_resource_pool: 
resource_pools:
  - gpu_type: rocm
    max_slots: 2
    pool_name: default

task_container_defaults:
  gpu_pod_spec: {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[{ \"name\": \"hsn0-host\", \"interface\": \"hsn0\" },{ \"name\": \"hsn1-host\", \"interface\": \"hsn1\" }]"},"labels":{"customLabel":"gpu-label"}},"spec":{"containers":[{"env":[{"name":"HIP_VISIBLE_DEVICES","value":"0,1,2,3,4,5,6,7"}],"name":"determined-container","resources":{"limits":{"amd.com/gpu":8}},"volumeMounts":[{"mountPath":"/dev/cxi0","name":"dev-cxi0"},{"mountPath":"/dev/cxi1","name":"dev-cxi1"},{"mountPath":"/host/shared","name":"shared"}]}],"volumes":[{"hostPath":{"path":"/dev/cxi0"},"name":"dev-cxi0"},{"hostPath":{"path":"/dev/cxi1"},"name":"dev-cxi1"},{"hostPath":{"path":"/home/users/davica/workspace/detAI/shared"},"name":"shared"}]}}
  image:
     cpu: "determinedai/pytorch-ngc-dev:0736b6d"
     gpu: "determinedai/environments:rocm-5.6-pytorch-1.3-tf-2.10-rocm-mpich-0736b6d"
     rocm: "determinedai/environments:rocm-5.6-pytorch-1.3-tf-2.10-rocm-mpich-0736b6d"
telemetry:
  enabled: true

I'm not sure if that's the reason, but when I start an experiment it is still looking for nvidia resources:

bardpeak014:~ # k describe pod det-69bc32d1-exp-8-trial-8-attempt-1-w27rg | grep -A 3 Event
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m26s (x2 over 8m52s)  default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: not eligible due to preemptionPolicy=Never..

For some reason slot_type in the values.yaml is being ignored during helm install.

Reproduction Steps

  1. Change values.yaml adding slot_type in the top context.
  2. deploy the helm chart.
  3. Check /etc/determined/master.yaml in the master pod. resource_manager.slot_type is not there.

Expected Behavior

We should see resource_manager.slot_type: rocm in the master pod /etc/determined/master.yaml

Screenshot

image

Environment

  • Device or hardware: AMD MI250X
  • OS: K8s runs on top of SLES. DetAI runs in the official registry images.
  • Browser - No browser
  • Version 0.37.0

Additional Context

No response

@caio-davi caio-davi added the bug label Oct 29, 2024
@caio-davi
Copy link
Author

My fault. I should have used slotType in the values.yaml, not slot_type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant