Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Fix GKELabelFormatter for H100s #3627

Merged
merged 2 commits into from
Jun 4, 2024
Merged

[k8s] Fix GKELabelFormatter for H100s #3627

merged 2 commits into from
Jun 4, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

get_gke_accelerator_name returns the incorrect label for H100 (nvidia-tesla-h100 instead of nvidia-h100). This PR fixes it by removing the incorrect conditional check for H100.

Tested:

  • sky launch --gpus H100:1 on a GKE cluster with H100s.

@romilbhardwaj
Copy link
Collaborator Author

H100s are a hard to get on GKE, so to test this PR I mocked it on my sky local up cluster with:

kubectl proxy

curl --header "Content-Type: application/json-patch+json" \
  --request PATCH \
  --data '[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]' \
  http://localhost:8001/api/v1/nodes/skypilot-control-plane/status


kubectl label nodes skypilot-control-plane cloud.google.com/gke-accelerator=nvidia-h100-80gb

With that, sky launch --gpus H100:1 works as expected, and the gpu name is now H100 (instead of H100-80GB).

@romilbhardwaj romilbhardwaj merged commit 0ebc5fd into master Jun 4, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the gkeh100fix branch June 4, 2024 04:40
Michaelvll pushed a commit that referenced this pull request Aug 23, 2024
* H100-80gb does not exist, fix to H100

* Fix H100 support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants