[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436

cblmemo · 2024-04-11T01:35:01Z

Due to a known issue in NVIDIA/nvidia-container-toolkit#48 , the container will lose access to GPU after systemd is used to manage the cgroups of the container. This is used when we shutdown jupyter service running in our default image id:

skypilot/sky/provision/docker_utils.py

Lines 271 to 274 in d0f20ab

    
           'sudo systemctl stop jupyter > /dev/null 2>&1 || true;' 
        
           'sudo systemctl disable jupyter > /dev/null 2>&1 || true;' 
        
           'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;' 
        
           'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;',

This PR fixes the bug using the workaround provided in the original issue. To reproduce, check the following (find the YAML at the end of the PR description):

$ sky launch @temp/bug.yaml -c bug
Task from YAML spec: @temp/bug.yaml
I 04-10 18:14:41 optimizer.py:693] == Optimizer ==
I 04-10 18:14:41 optimizer.py:704] Target: minimizing cost
I 04-10 18:14:41 optimizer.py:716] Estimated cost: $8.0 / hour
I 04-10 18:14:41 optimizer.py:716] 
I 04-10 18:14:41 optimizer.py:839] Considered resources (1 node):
I 04-10 18:14:41 optimizer.py:909] ---------------------------------------------------------------------------------------------
I 04-10 18:14:41 optimizer.py:909]  CLOUD   INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 04-10 18:14:41 optimizer.py:909] ---------------------------------------------------------------------------------------------
I 04-10 18:14:41 optimizer.py:909]  GCP     g2-standard-96   96      384       L4:8           us-east4-a    7.98          ✔     
I 04-10 18:14:41 optimizer.py:909] ---------------------------------------------------------------------------------------------
I 04-10 18:14:41 optimizer.py:909] 
Launching a new cluster 'bug'. Proceed? [Y/n]: 
I 04-10 18:14:43 cloud_vm_ray_backend.py:4226] Creating a new cluster: 'bug' [1x GCP(g2-standard-96, {'L4': 8}, image_id={'us-east4': 'docker:cblmemo/mixtral-vllm:latest'}, ports=['8080'])].
I 04-10 18:14:43 cloud_vm_ray_backend.py:4226] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-10 18:14:46 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/txia/sky_logs/sky-2024-04-10-18-14-39-012193/provision.log
I 04-10 18:14:48 provisioner.py:77] Launching on GCP us-east4 (us-east4-a)
I 04-10 18:18:49 provisioner.py:461] Successfully provisioned or found existing instance.
I 04-10 18:24:16 provisioner.py:563] Successfully provisioned cluster: bug
⠧ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 04-10 18:24:31 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
Warning: Permanently added '35.245.41.151' (ECDSA) to the list of known hosts.
Warning: Permanently added '[localhost]:10022' (ECDSA) to the list of known hosts.
Requirement already satisfied: vllm==0.3.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (0.3.0)
Requirement already satisfied: transformers==4.37.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (4.37.2)
Requirement already satisfied: ninja in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (1.11.1.1)
Requirement already satisfied: psutil in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (5.9.8)
Requirement already satisfied: ray>=2.9 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (2.10.0)
Requirement already satisfied: sentencepiece in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.2.0)
Requirement already satisfied: numpy in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (1.26.4)
Requirement already satisfied: torch==2.1.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (2.1.2)
Requirement already satisfied: xformers==0.0.23.post1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.0.23.post1)
Requirement already satisfied: fastapi in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.110.0)
Requirement already satisfied: uvicorn[standard] in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.29.0)
Requirement already satisfied: pydantic>=2.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (2.6.4)
Requirement already satisfied: aioprometheus[starlette] in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (23.12.0)
Requirement already satisfied: pynvml==11.5.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (11.5.0)
Requirement already satisfied: filelock in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (0.21.4)
Requirement already satisfied: packaging>=20.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (24.0)
Requirement already satisfied: pyyaml>=5.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (2023.12.25)
Requirement already satisfied: requests in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (0.4.2)
Requirement already satisfied: tqdm>=4.27 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (4.66.2)
Requirement already satisfied: typing-extensions in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (4.10.0)
Requirement already satisfied: sympy in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (1.12)
Requirement already satisfied: networkx in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (3.2.1)
Requirement already satisfied: jinja2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (3.1.3)
Requirement already satisfied: fsspec in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (2024.3.1)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: triton==2.1.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch==2.1.2->vllm==0.3.0) (12.4.99)
Requirement already satisfied: annotated-types>=0.4.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from pydantic>=2.0->vllm==0.3.0) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.3 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from pydantic>=2.0->vllm==0.3.0) (2.16.3)
Requirement already satisfied: click>=7.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (8.1.7)
Requirement already satisfied: jsonschema in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (4.21.1)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (1.0.8)
Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (5.26.0)
Requirement already satisfied: aiosignal in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (1.3.1)
Requirement already satisfied: frozenlist in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (1.4.1)
Requirement already satisfied: orjson in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from aioprometheus[starlette]->vllm==0.3.0) (3.9.15)
Requirement already satisfied: quantile-python>=1.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from aioprometheus[starlette]->vllm==0.3.0) (1.1)
Requirement already satisfied: starlette>=0.14.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from aioprometheus[starlette]->vllm==0.3.0) (0.36.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (2024.2.2)
Requirement already satisfied: h11>=0.8 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.14.0)
Requirement already satisfied: httptools>=0.5.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.6.1)
Requirement already satisfied: python-dotenv>=0.13 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (1.0.1)
Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.19.0)
Requirement already satisfied: watchfiles>=0.13 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.21.0)
Requirement already satisfied: websockets>=10.4 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (12.0)
Requirement already satisfied: anyio<5,>=3.4.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from starlette>=0.14.2->aioprometheus[starlette]->vllm==0.3.0) (4.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jinja2->torch==2.1.2->vllm==0.3.0) (2.1.5)
Requirement already satisfied: attrs>=22.2.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (23.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (0.34.0)
Requirement already satisfied: rpds-py>=0.7.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (0.18.0)
Requirement already satisfied: mpmath>=0.19 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from sympy->torch==2.1.2->vllm==0.3.0) (1.3.0)
Requirement already satisfied: sniffio>=1.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette>=0.14.2->aioprometheus[starlette]->vllm==0.3.0) (1.3.1)
Requirement already satisfied: exceptiongroup>=1.0.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette>=0.14.2->aioprometheus[starlette]->vllm==0.3.0) (1.2.0)
I 04-10 18:24:46 cloud_vm_ray_backend.py:3089] Setup completed.
I 04-10 18:25:02 cloud_vm_ray_backend.py:3160] Job submitted with Job ID: 1
Warning: Permanently added '35.245.41.151' (ECDSA) to the list of known hosts.
Warning: Permanently added '[localhost]:10022' (ECDSA) to the list of known hosts.
I 04-11 01:25:08 log_lib.py:407] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.15.210']
(task, pid=3018) INFO 04-11 01:25:10 api_server.py:209] args: Namespace(host='0.0.0.0', port=8080, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(task, pid=3018) 2024-04-11 01:25:12,533        INFO worker.py:1752 -- Started a local Ray instance.
(task, pid=3018) Traceback (most recent call last):
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(task, pid=3018)     return _run_code(code, main_globals, None,
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
(task, pid=3018)     exec(code, run_globals)
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
(task, pid=3018)     engine = AsyncLLMEngine.from_engine_args(engine_args)
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 620, in from_engine_args
(task, pid=3018)     placement_group = initialize_cluster(parallel_config,
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 111, in initialize_cluster
(task, pid=3018)     raise ValueError(
(task, pid=3018) ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.
ERROR: Job 1 failed with return code list: [1] 
INFO: Job finished (status: FAILED).
Connection to localhost closed.
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] Job ID: 1
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] To cancel the job:       sky cancel bug 1
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] To stream job logs:      sky logs bug 1
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] To view the job queue:   sky queue bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] 
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] Cluster name: bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To log into the head VM: ssh bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To submit a job:         sky exec bug yaml_file
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To stop the cluster:     sky stop bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To teardown the cluster: sky down bug

$ sky exec bug nvidia-smi                          
Task from command: nvidia-smi
Executing task on cluster bug...
I 04-10 18:36:19 cloud_vm_ray_backend.py:3160] Job submitted with Job ID: 2
Warning: Permanently added '35.245.41.151' (ECDSA) to the list of known hosts.
Warning: Permanently added '[localhost]:10022' (ECDSA) to the list of known hosts.
I 04-11 01:36:24 log_lib.py:407] Start streaming logs for job 2.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.15.210']
(sky-cmd, pid=3134) Failed to initialize NVML: Unknown Error
ERROR: Job 2 failed with return code list: [255] 
Connection to localhost closed.
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] Job ID: 2
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] To cancel the job:       sky cancel bug 2
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] To stream job logs:      sky logs bug 2
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] To view the job queue:   sky queue bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] 
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] Cluster name: bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To log into the head VM: ssh bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To submit a job:         sky exec bug yaml_file
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To stop the cluster:     sky stop bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To teardown the cluster: sky down bug

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky launch following yaml and works well

# @temp/bug.yaml
resources:
  cloud: gcp
  image_id: docker:cblmemo/mixtral-vllm:latest  # a docker container w/ vllm dependencies
  ports: 8080
  accelerators: L4:8

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi
  pip install vllm==0.3.0 transformers==4.37.2

run: |
  conda activate vllm
  export PATH=$PATH:/sbin
  python -m vllm.entrypoints.openai.api_server \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --host 0.0.0.0 --port 8080 \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1

All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
- tests/test_smoke.py::test_job_queue_with_docker --[azure,gcp,aws] except for [Core][Docker] docker:continuumio/miniconda3 is not working for Azure when used as a runtime environment #3451
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

cblmemo · 2024-04-11T01:35:35Z

separate from #3073 first as the original PR seems to take more time to merge

Michaelvll

Thanks for identifying this issue @cblmemo! This is a hard issue to solve. Excellent!

Could we run some more tests to make sure the newly added steps work as expected for launch, exec, stop and restart, etc? It would be nice to test for AWS and some other clouds as well, just to make sure those default images have jq. : )

sky/clouds/service_catalog/common.py

sky/provision/docker_utils.py

Michaelvll · 2024-04-11T15:06:55Z

sky/skylet/providers/command_runner.py

+            # Edit docker config first to avoid the know issue in
+            # https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
+            self.run(
+                'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' '
+                '/etc/docker/daemon.json > /tmp/daemon.json;'
+                'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'
+                'sudo systemctl restart docker',
+                run_env='host')


Could we run some test for Azure to test out this part of the code, to make sure it works nice for initial launch, exec, stop and restart, etc?

Sounds good! Will add some test commands in test_job_queue_with_docker

@Michaelvll Relevant smoke test passed except for #3451. Please take another look 👀

Michaelvll

Thanks for updating the PR @cblmemo! LGTM.

tests/test_smoke.py

Michaelvll · 2024-04-23T06:39:10Z

examples/job_queue/job_docker.yaml

@@ -18,7 +18,7 @@ setup: |
 run: |
  timestamp=$(date +%s)
  conda env list
-  for i in {1..180}; do
+  for i in {1..300}; do


Do we need to increase the time here? Does this mean the newly added commands introduce significant overheads?

IIRC I think it is just Azure being too slow... Originally we does not test on Azure. Lemme test the original time w/ AWS/GCP again

Yes it is only for Azure. Now only setting to 300 for Azure

Co-authored-by: Zhanghao Wu <[email protected]>

…-gpu

fix

19930f7

Michaelvll reviewed Apr 11, 2024

View reviewed changes

cblmemo added 5 commits April 16, 2024 23:50

Apply suggestions from code review

44d7f81

fix and add smoke test

c24208d

increase timeout

fd95932

longer timeout

14d5cfe

Merge remote-tracking branch 'origin/master' into fix-docker-gpu

dbad5db

Michaelvll self-requested a review April 22, 2024 17:05

Fix docker test to avoid using python 3.12

b2b43b7

Michaelvll approved these changes Apr 23, 2024

View reviewed changes

cblmemo and others added 5 commits April 23, 2024 14:42

Update tests/test_smoke.py

809f795

Co-authored-by: Zhanghao Wu <[email protected]>

Update tests/test_smoke.py

28e3e0a

Co-authored-by: Zhanghao Wu <[email protected]>

fix smoketest

6d9a47e

change time to sleep

01c3e62

Merge remote-tracking branch 'origin/fix-docker-test' into fix-docker…

2e3949f

…-gpu

cblmemo merged commit 99408b3 into master Apr 23, 2024
20 checks passed

cblmemo deleted the fix-docker-gpu branch April 23, 2024 08:54

Michaelvll mentioned this pull request Apr 25, 2024

[Core] Fix docker in image_id #3481

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436

[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436

cblmemo commented Apr 11, 2024 •

edited

Loading

cblmemo commented Apr 11, 2024

Michaelvll left a comment

Michaelvll Apr 11, 2024

cblmemo Apr 17, 2024

cblmemo Apr 21, 2024

Michaelvll left a comment

Michaelvll Apr 23, 2024

cblmemo Apr 23, 2024

cblmemo Apr 23, 2024

	'sudo systemctl stop jupyter > /dev/null 2>&1 \|\| true;'
	'sudo systemctl disable jupyter > /dev/null 2>&1 \|\| true;'
	'sudo systemctl stop jupyterhub > /dev/null 2>&1 \|\| true;'
	'sudo systemctl disable jupyterhub > /dev/null 2>&1 \|\| true;',

[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436

[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436

Conversation

cblmemo commented Apr 11, 2024 • edited Loading

cblmemo commented Apr 11, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Apr 11, 2024

Choose a reason for hiding this comment

cblmemo Apr 17, 2024

Choose a reason for hiding this comment

cblmemo Apr 21, 2024

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Apr 23, 2024

Choose a reason for hiding this comment

cblmemo Apr 23, 2024

Choose a reason for hiding this comment

cblmemo Apr 23, 2024

Choose a reason for hiding this comment

cblmemo commented Apr 11, 2024 •

edited

Loading