Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][BugFix] Fix GPU detach when using docker container as runtime env #3436

Merged
merged 12 commits into from
Apr 23, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Apr 11, 2024

Due to a known issue in NVIDIA/nvidia-container-toolkit#48 , the container will lose access to GPU after systemd is used to manage the cgroups of the container. This is used when we shutdown jupyter service running in our default image id:

'sudo systemctl stop jupyter > /dev/null 2>&1 || true;'
'sudo systemctl disable jupyter > /dev/null 2>&1 || true;'
'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;'
'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;',

This PR fixes the bug using the workaround provided in the original issue. To reproduce, check the following (find the YAML at the end of the PR description):

$ sky launch @temp/bug.yaml -c bug
Task from YAML spec: @temp/bug.yaml
I 04-10 18:14:41 optimizer.py:693] == Optimizer ==
I 04-10 18:14:41 optimizer.py:704] Target: minimizing cost
I 04-10 18:14:41 optimizer.py:716] Estimated cost: $8.0 / hour
I 04-10 18:14:41 optimizer.py:716] 
I 04-10 18:14:41 optimizer.py:839] Considered resources (1 node):
I 04-10 18:14:41 optimizer.py:909] ---------------------------------------------------------------------------------------------
I 04-10 18:14:41 optimizer.py:909]  CLOUD   INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 04-10 18:14:41 optimizer.py:909] ---------------------------------------------------------------------------------------------
I 04-10 18:14:41 optimizer.py:909]  GCP     g2-standard-96   96      384       L4:8           us-east4-a    7.98          ✔     
I 04-10 18:14:41 optimizer.py:909] ---------------------------------------------------------------------------------------------
I 04-10 18:14:41 optimizer.py:909] 
Launching a new cluster 'bug'. Proceed? [Y/n]: 
I 04-10 18:14:43 cloud_vm_ray_backend.py:4226] Creating a new cluster: 'bug' [1x GCP(g2-standard-96, {'L4': 8}, image_id={'us-east4': 'docker:cblmemo/mixtral-vllm:latest'}, ports=['8080'])].
I 04-10 18:14:43 cloud_vm_ray_backend.py:4226] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-10 18:14:46 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/txia/sky_logs/sky-2024-04-10-18-14-39-012193/provision.log
I 04-10 18:14:48 provisioner.py:77] Launching on GCP us-east4 (us-east4-a)
I 04-10 18:18:49 provisioner.py:461] Successfully provisioned or found existing instance.
I 04-10 18:24:16 provisioner.py:563] Successfully provisioned cluster: bug
⠧ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 04-10 18:24:31 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
Warning: Permanently added '35.245.41.151' (ECDSA) to the list of known hosts.
Warning: Permanently added '[localhost]:10022' (ECDSA) to the list of known hosts.
Requirement already satisfied: vllm==0.3.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (0.3.0)
Requirement already satisfied: transformers==4.37.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (4.37.2)
Requirement already satisfied: ninja in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (1.11.1.1)
Requirement already satisfied: psutil in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (5.9.8)
Requirement already satisfied: ray>=2.9 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (2.10.0)
Requirement already satisfied: sentencepiece in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.2.0)
Requirement already satisfied: numpy in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (1.26.4)
Requirement already satisfied: torch==2.1.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (2.1.2)
Requirement already satisfied: xformers==0.0.23.post1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.0.23.post1)
Requirement already satisfied: fastapi in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.110.0)
Requirement already satisfied: uvicorn[standard] in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (0.29.0)
Requirement already satisfied: pydantic>=2.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (2.6.4)
Requirement already satisfied: aioprometheus[starlette] in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (23.12.0)
Requirement already satisfied: pynvml==11.5.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from vllm==0.3.0) (11.5.0)
Requirement already satisfied: filelock in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (0.21.4)
Requirement already satisfied: packaging>=20.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (24.0)
Requirement already satisfied: pyyaml>=5.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (2023.12.25)
Requirement already satisfied: requests in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (0.4.2)
Requirement already satisfied: tqdm>=4.27 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from transformers==4.37.2) (4.66.2)
Requirement already satisfied: typing-extensions in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (4.10.0)
Requirement already satisfied: sympy in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (1.12)
Requirement already satisfied: networkx in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (3.2.1)
Requirement already satisfied: jinja2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (3.1.3)
Requirement already satisfied: fsspec in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (2024.3.1)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (12.1.105)
Requirement already satisfied: triton==2.1.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from torch==2.1.2->vllm==0.3.0) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch==2.1.2->vllm==0.3.0) (12.4.99)
Requirement already satisfied: annotated-types>=0.4.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from pydantic>=2.0->vllm==0.3.0) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.3 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from pydantic>=2.0->vllm==0.3.0) (2.16.3)
Requirement already satisfied: click>=7.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (8.1.7)
Requirement already satisfied: jsonschema in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (4.21.1)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (1.0.8)
Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (5.26.0)
Requirement already satisfied: aiosignal in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (1.3.1)
Requirement already satisfied: frozenlist in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from ray>=2.9->vllm==0.3.0) (1.4.1)
Requirement already satisfied: orjson in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from aioprometheus[starlette]->vllm==0.3.0) (3.9.15)
Requirement already satisfied: quantile-python>=1.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from aioprometheus[starlette]->vllm==0.3.0) (1.1)
Requirement already satisfied: starlette>=0.14.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from aioprometheus[starlette]->vllm==0.3.0) (0.36.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from requests->transformers==4.37.2) (2024.2.2)
Requirement already satisfied: h11>=0.8 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.14.0)
Requirement already satisfied: httptools>=0.5.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.6.1)
Requirement already satisfied: python-dotenv>=0.13 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (1.0.1)
Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.19.0)
Requirement already satisfied: watchfiles>=0.13 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (0.21.0)
Requirement already satisfied: websockets>=10.4 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from uvicorn[standard]->vllm==0.3.0) (12.0)
Requirement already satisfied: anyio<5,>=3.4.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from starlette>=0.14.2->aioprometheus[starlette]->vllm==0.3.0) (4.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jinja2->torch==2.1.2->vllm==0.3.0) (2.1.5)
Requirement already satisfied: attrs>=22.2.0 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (23.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (0.34.0)
Requirement already satisfied: rpds-py>=0.7.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from jsonschema->ray>=2.9->vllm==0.3.0) (0.18.0)
Requirement already satisfied: mpmath>=0.19 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from sympy->torch==2.1.2->vllm==0.3.0) (1.3.0)
Requirement already satisfied: sniffio>=1.1 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette>=0.14.2->aioprometheus[starlette]->vllm==0.3.0) (1.3.1)
Requirement already satisfied: exceptiongroup>=1.0.2 in /home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette>=0.14.2->aioprometheus[starlette]->vllm==0.3.0) (1.2.0)
I 04-10 18:24:46 cloud_vm_ray_backend.py:3089] Setup completed.
I 04-10 18:25:02 cloud_vm_ray_backend.py:3160] Job submitted with Job ID: 1
Warning: Permanently added '35.245.41.151' (ECDSA) to the list of known hosts.
Warning: Permanently added '[localhost]:10022' (ECDSA) to the list of known hosts.
I 04-11 01:25:08 log_lib.py:407] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.15.210']
(task, pid=3018) INFO 04-11 01:25:10 api_server.py:209] args: Namespace(host='0.0.0.0', port=8080, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(task, pid=3018) 2024-04-11 01:25:12,533        INFO worker.py:1752 -- Started a local Ray instance.
(task, pid=3018) Traceback (most recent call last):
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(task, pid=3018)     return _run_code(code, main_globals, None,
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
(task, pid=3018)     exec(code, run_globals)
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
(task, pid=3018)     engine = AsyncLLMEngine.from_engine_args(engine_args)
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 620, in from_engine_args
(task, pid=3018)     placement_group = initialize_cluster(parallel_config,
(task, pid=3018)   File "/home/ray/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 111, in initialize_cluster
(task, pid=3018)     raise ValueError(
(task, pid=3018) ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster.
ERROR: Job 1 failed with return code list: [1] 
INFO: Job finished (status: FAILED).
Connection to localhost closed.
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] Job ID: 1
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] To cancel the job:       sky cancel bug 1
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] To stream job logs:      sky logs bug 1
I 04-10 18:25:17 cloud_vm_ray_backend.py:3194] To view the job queue:   sky queue bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] 
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] Cluster name: bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To log into the head VM: ssh bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To submit a job:         sky exec bug yaml_file
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To stop the cluster:     sky stop bug
I 04-10 18:25:17 cloud_vm_ray_backend.py:3290] To teardown the cluster: sky down bug

$ sky exec bug nvidia-smi                          
Task from command: nvidia-smi
Executing task on cluster bug...
I 04-10 18:36:19 cloud_vm_ray_backend.py:3160] Job submitted with Job ID: 2
Warning: Permanently added '35.245.41.151' (ECDSA) to the list of known hosts.
Warning: Permanently added '[localhost]:10022' (ECDSA) to the list of known hosts.
I 04-11 01:36:24 log_lib.py:407] Start streaming logs for job 2.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.15.210']
(sky-cmd, pid=3134) Failed to initialize NVML: Unknown Error
ERROR: Job 2 failed with return code list: [255] 
Connection to localhost closed.
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] Job ID: 2
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] To cancel the job:       sky cancel bug 2
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] To stream job logs:      sky logs bug 2
I 04-10 18:36:25 cloud_vm_ray_backend.py:3194] To view the job queue:   sky queue bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] 
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] Cluster name: bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To log into the head VM: ssh bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To submit a job:         sky exec bug yaml_file
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To stop the cluster:     sky stop bug
I 04-10 18:36:25 cloud_vm_ray_backend.py:3290] To teardown the cluster: sky down bug

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch following yaml and works well
# @temp/bug.yaml
resources:
  cloud: gcp
  image_id: docker:cblmemo/mixtral-vllm:latest  # a docker container w/ vllm dependencies
  ports: 8080
  accelerators: L4:8

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi
  pip install vllm==0.3.0 transformers==4.37.2

run: |
  conda activate vllm
  export PATH=$PATH:/sbin
  python -m vllm.entrypoints.openai.api_server \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --host 0.0.0.0 --port 8080 \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1

@cblmemo
Copy link
Collaborator Author

cblmemo commented Apr 11, 2024

separate from #3073 first as the original PR seems to take more time to merge

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying this issue @cblmemo! This is a hard issue to solve. Excellent!

Could we run some more tests to make sure the newly added steps work as expected for launch, exec, stop and restart, etc? It would be nice to test for AWS and some other clouds as well, just to make sure those default images have jq. : )

sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved
sky/provision/docker_utils.py Outdated Show resolved Hide resolved
Comment on lines 232 to 239
# Edit docker config first to avoid the know issue in
# https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
self.run(
'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' '
'/etc/docker/daemon.json > /tmp/daemon.json;'
'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'
'sudo systemctl restart docker',
run_env='host')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we run some test for Azure to test out this part of the code, to make sure it works nice for initial launch, exec, stop and restart, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Will add some test commands in test_job_queue_with_docker

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Relevant smoke test passed except for #3451. Please take another look 👀

@Michaelvll Michaelvll self-requested a review April 22, 2024 17:05
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @cblmemo! LGTM.

tests/test_smoke.py Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
@@ -18,7 +18,7 @@ setup: |
run: |
timestamp=$(date +%s)
conda env list
for i in {1..180}; do
for i in {1..300}; do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to increase the time here? Does this mean the newly added commands introduce significant overheads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I think it is just Azure being too slow... Originally we does not test on Azure. Lemme test the original time w/ AWS/GCP again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is only for Azure. Now only setting to 300 for Azure

@cblmemo cblmemo merged commit 99408b3 into master Apr 23, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-docker-gpu branch April 23, 2024 08:54
@Michaelvll Michaelvll mentioned this pull request Apr 25, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants