-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436
Conversation
separate from #3073 first as the original PR seems to take more time to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for identifying this issue @cblmemo! This is a hard issue to solve. Excellent!
Could we run some more tests to make sure the newly added steps work as expected for launch
, exec
, stop and restart, etc? It would be nice to test for AWS and some other clouds as well, just to make sure those default images have jq
. : )
# Edit docker config first to avoid the know issue in | ||
# https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 | ||
self.run( | ||
'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' ' | ||
'/etc/docker/daemon.json > /tmp/daemon.json;' | ||
'sudo mv /tmp/daemon.json /etc/docker/daemon.json;' | ||
'sudo systemctl restart docker', | ||
run_env='host') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we run some test for Azure to test out this part of the code, to make sure it works nice for initial launch, exec, stop and restart, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Will add some test commands in test_job_queue_with_docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll Relevant smoke test passed except for #3451. Please take another look 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR @cblmemo! LGTM.
examples/job_queue/job_docker.yaml
Outdated
@@ -18,7 +18,7 @@ setup: | | |||
run: | | |||
timestamp=$(date +%s) | |||
conda env list | |||
for i in {1..180}; do | |||
for i in {1..300}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to increase the time here? Does this mean the newly added commands introduce significant overheads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC I think it is just Azure being too slow... Originally we does not test on Azure. Lemme test the original time w/ AWS/GCP again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is only for Azure. Now only setting to 300 for Azure
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Due to a known issue in NVIDIA/nvidia-container-toolkit#48 , the container will lose access to GPU after systemd is used to manage the cgroups of the container. This is used when we shutdown jupyter service running in our default image id:
skypilot/sky/provision/docker_utils.py
Lines 271 to 274 in d0f20ab
This PR fixes the bug using the workaround provided in the original issue. To reproduce, check the following (find the YAML at the end of the PR description):
Tested (run the relevant ones):
bash format.sh
sky launch
following yaml and works wellpytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
tests/test_smoke.py::test_job_queue_with_docker --[azure,gcp,aws]
except for [Core][Docker]docker:continuumio/miniconda3
is not working for Azure when used as a runtime environment #3451bash tests/backward_comaptibility_tests.sh