Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Revert Azure images to address NCCL issues #4596

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

romilbhardwaj
Copy link
Collaborator

Intermediate fix for #4448 by reverting to Azure's default images. We should fix our custom image to support NCCL + Azure accelerated networking before starting to use them again.

Tested:

  • sky launch -c azure --cloud azure --gpus A100-80GB:1 -- nvidia-smi

Comment on lines +42 to +46
# TODO(romilb): Switch back to using our custom images after NCCL + Azure issues
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448
_DEFAULT_CPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204'
_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204'
_DEFAULT_V1_IMAGE_ID = 'skypilot:v1-ubuntu-2004'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the old Azure image as the base image for the packer file, so we don't get performance regression for A10 GPU instances?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's likely the fix for #4448. For A10 this PR is still using skypilot:custom-gpu-ubuntu-v2-grid to avoid the perf regression.

@romilbhardwaj
Copy link
Collaborator Author

/smoke-test azure

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this temp solution, but we should figure out a good way to support it with our images, otherwise, this is a huge performance degradation on Azure.

_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v2'
_DEFAULT_V1_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v1'
# TODO(romilb): Switch back to using our custom images after NCCL + Azure issues
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448
# are resolved: https://github.com/skypilot-org/skypilot/issues/4448

@romilbhardwaj
Copy link
Collaborator Author

I think a better fix for this is to update the catalog instead of our code. That way we can push an update the catalog when this issue is resolved without needing users to upgrade to the latest version/nightly.

@Michaelvll
Copy link
Collaborator

I think a better fix for this is to update the catalog instead of our code. That way we can push an update the catalog when this issue is resolved without needing users to upgrade to the latest version/nightly.

This may worth discussion. Directly updating the catalog can cause an implicit behavior change for the VM launched on Azure. This can lead to surprising behavior, such as the cluster launched today has different setup than yesterday, e.g., some specific packages disappear, etc.

I feel a more explicit tag change in this PR is a better way to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants