Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS/GCP] Upgrade image to support CUDA 12.1 #2694

Closed
Michaelvll opened this issue Oct 11, 2023 · 5 comments
Closed

[AWS/GCP] Upgrade image to support CUDA 12.1 #2694

Michaelvll opened this issue Oct 11, 2023 · 5 comments

Comments

@Michaelvll
Copy link
Collaborator

As pytorch has upgraded its default requirement for CUDA to be 12.1, we should upgrade the image to support this latest CUDA version.

@Michaelvll Michaelvll added the P0 label Oct 11, 2023
@Michaelvll
Copy link
Collaborator Author

Just tried the latest projects/deeplearning-platform-release/global/images/common-gpu-v20231105-debian-11-py310 on GCP, but unfortunately, it does not have CUDA 12.1 installed, which still causes the issue with vllm installation #2786. We should figure out a workaround for this.

@Michaelvll
Copy link
Collaborator Author

This has been fixed by #2788 and skypilot-org/skypilot-catalog#49. Closing.

@romilbhardwaj
Copy link
Collaborator

Re-opening this - CUDA version mismatch across GCP and AWS makes it hard to make the same YAML work across clouds. We should update the GCP image (and k8s) to also use CUDA 12, for consistency with AWS.

@romilbhardwaj romilbhardwaj reopened this Feb 28, 2024
@romilbhardwaj romilbhardwaj removed the P0 label Feb 28, 2024
@Michaelvll
Copy link
Collaborator Author

Re-opening this - CUDA version mismatch across GCP and AWS makes it hard to make the same YAML work across clouds. We should update the GCP image (and k8s) to also use CUDA 12, for consistency with AWS.

Just tested with GCP and it seems the CUDA is indeed 12.2. Is there a way to reproduce this?
sky launch -c test-gpu --gpus l4 nvidia-smi

@romilbhardwaj
Copy link
Collaborator

Ahh looks like my catalog was not being auto-updated. I fetched the latest one and now AWS and GCP are at parity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants