You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(VllmWorkerProcess pid=4091308) ERROR 07-12 15:20:49 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 360, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 256, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]: self._run_workers("initialize_cache",
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
[rank0]: self._warm_up_model()
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
[rank0]: self.model_runner.capture_model(self.gpu_cache)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
[rank0]: graph_runner.capture(**capture_inputs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1339, in capture
[rank0]: with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/cuda/graphs.py", line 184, in exit
[rank0]: self.cuda_graph.capture_end()
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/cuda/graphs.py", line 82, in capture_end
[rank0]: super().capture_end()
[rank0]: RuntimeError: CUDA error: out of memory
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 07-12 15:20:50 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 4091308 died, exit code: -15
INFO 07-12 15:20:50 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
Seems similar to #6192 but confirmed that the proper version of flash infer is installed.
Command
export VLLM_ATTENTION_BACKEND=FLASHINFER python -m vllm.entrypoints.openai.api_server \ --model google/gemma-2-27b-it \ --tensor-parallel-size 2
Model fails to load with this error:
(VllmWorkerProcess pid=4091308) ERROR 07-12 15:20:49 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 360, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 256, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]: self._run_workers("initialize_cache",
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
[rank0]: self._warm_up_model()
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model
[rank0]: self.model_runner.capture_model(self.gpu_cache)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model
[rank0]: graph_runner.capture(**capture_inputs)
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1339, in capture
[rank0]: with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/cuda/graphs.py", line 184, in exit
[rank0]: self.cuda_graph.capture_end()
[rank0]: File "/home/daielloiir/deepspeed/.gemma/lib/python3.10/site-packages/torch/cuda/graphs.py", line 82, in capture_end
[rank0]: super().capture_end()
[rank0]: RuntimeError: CUDA error: out of memory
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.ERROR 07-12 15:20:50 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 4091308 died, exit code: -15
INFO 07-12 15:20:50 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/cm/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
The text was updated successfully, but these errors were encountered: