-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure] Revert Azure images to address NCCL issues #4596
base: master
Are you sure you want to change the base?
Conversation
# TODO(romilb): Switch back to using our custom images after NCCL + Azure issues | ||
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448 | ||
_DEFAULT_CPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' | ||
_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' | ||
_DEFAULT_V1_IMAGE_ID = 'skypilot:v1-ubuntu-2004' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the old Azure image as the base image for the packer file, so we don't get performance regression for A10 GPU instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's likely the fix for #4448. For A10 this PR is still using skypilot:custom-gpu-ubuntu-v2-grid
to avoid the perf regression.
/smoke-test azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with this temp solution, but we should figure out a good way to support it with our images, otherwise, this is a huge performance degradation on Azure.
_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v2' | ||
_DEFAULT_V1_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v1' | ||
# TODO(romilb): Switch back to using our custom images after NCCL + Azure issues | ||
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448 | |
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448 |
I think a better fix for this is to update the catalog instead of our code. That way we can push an update the catalog when this issue is resolved without needing users to upgrade to the latest version/nightly. |
This may worth discussion. Directly updating the catalog can cause an implicit behavior change for the VM launched on Azure. This can lead to surprising behavior, such as the cluster launched today has different setup than yesterday, e.g., some specific packages disappear, etc. I feel a more explicit tag change in this PR is a better way to do this. |
Intermediate fix for #4448 by reverting to Azure's default images. We should fix our custom image to support NCCL + Azure accelerated networking before starting to use them again.
Tested:
sky launch -c azure --cloud azure --gpus A100-80GB:1 -- nvidia-smi