You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please forgive me for the complex parameter settings, as I conducted numerous searches and attempts to successfully deploy and added many parameters to ensure it works.
And my device configuration is as follows:
System:
Ubuntu 20.04.4 LTS
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3283.879
CPU max MHz: 4000.0000
CPU min MHz: 1200.0000
BogoMIPS: 6000.00
Virtualization: VT-x
L1d cache: 1.5 MiB
L1i cache: 1.5 MiB
L2 cache: 48 MiB
L3 cache: 71.5 MiB
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
GPU:
NVIDIA RTX A6000 * 8 (use 4 of them)
The package list of my python virtual environment is as follows:
During the inference, I found that if the prompt is short (or there are few images passed in), the model runs normally in most cases. However, when the prompt is long (or there are many images passed in), the VLLM program will get stuck or generate errors and output the following error message:
ERROR 11-07 23:24:51 async_llm_engine.py:865] Engine iteration timed out. This should never happen!
INFO 11-07 23:24:51 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%.
ERROR 11-07 23:24:51 async_llm_engine.py:64] Engine background task failed
ERROR 11-07 23:24:51 async_llm_engine.py:64] Traceback (most recent call last):
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 849, in run_engine_loop
ERROR 11-07 23:24:51 async_llm_engine.py:64] await asyncio.sleep(0)
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 596, in sleep
ERROR 11-07 23:24:51 async_llm_engine.py:64] await __sleep0()
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 590, in __sleep0
ERROR 11-07 23:24:51 async_llm_engine.py:64] yield
ERROR 11-07 23:24:51 async_llm_engine.py:64] asyncio.exceptions.CancelledError
ERROR 11-07 23:24:51 async_llm_engine.py:64]
ERROR 11-07 23:24:51 async_llm_engine.py:64] During handling of the above exception, another exception occurred:
ERROR 11-07 23:24:51 async_llm_engine.py:64]
ERROR 11-07 23:24:51 async_llm_engine.py:64] Traceback (most recent call last):
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
ERROR 11-07 23:24:51 async_llm_engine.py:64] return_value = task.result()
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 844, in run_engine_loop
ERROR 11-07 23:24:51 async_llm_engine.py:64] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-07 23:24:51 async_llm_engine.py:64] self._do_exit(exc_type)
ERROR 11-07 23:24:51 async_llm_engine.py:64] File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-07 23:24:51 async_llm_engine.py:64] raise asyncio.TimeoutError
ERROR 11-07 23:24:51 async_llm_engine.py:64] asyncio.exceptions.TimeoutError
Exception in callback functools.partial(<function _log_task_completion at 0x7fb78e7bf400>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fb78213a170>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7fb78e7bf400>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fb78213a170>>)>
Traceback (most recent call last):
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 849, in run_engine_loop
await asyncio.sleep(0)
File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 596, in sleep
await __sleep0()
File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 590, in __sleep0
yield
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
return_value = task.result()
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 844, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
self._do_exit(exc_type)
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 66, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO: 127.0.0.1:55016 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 849, in run_engine_loop
await asyncio.sleep(0)
File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 596, in sleep
await __sleep0()
File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 590, in __sleep0
yield
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/.conda/envs/python/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/.conda/envs/python/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
response = await f(request)
File "/home/.conda/envs/python/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
File "/home/.conda/envs/python/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
generator = await chat(raw_request).create_chat_completion(
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
return await self.chat_completion_full_generator(
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 624, in chat_completion_full_generator
async for res in result_generator:
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
item = await awaits[0]
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1029, in generate
async for output in await self.add_request(
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 112, in generator
raise result
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
return_value = task.result()
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 844, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
self._do_exit(exc_type)
File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
CRITICAL 11-07 23:24:52 launcher.py:88] AsyncLLMEngine is already dead, terminating server process
INFO: 127.0.0.1:58202 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [787966]
The subprocesses of vllm also do not automatically terminate and continue to occupy the GPU unless you manually kill them.
I have tried various measures, including but not limited to: changing the -tp parameter to -pp, adding the --disable-custom-all-reduce parameter, reducing the --gpu-memory-utilization, and upgrading the vllm version, but the situation has not improved yet.
All I hoped is someone can help me to ensure that qwen2-vl-7b can perform inference stably within its context length range(32768, including image tokens) without experiencing llm_engine crashes or similar issues. 🐧
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
I think this is potentially caused by the long processing time, as documented in #9238. You can try preprocessing the images to be smaller before passing them to vLLM and/or set max_pixels via --mm-processor-kwargs.
I think this is potentially caused by the long processing time, as documented in #9238. You can try preprocessing the images to be smaller before passing them to vLLM and/or set max_pixels via --mm-processor-kwargs.
Your current environment
How would you like to use vllm
I I tried deploying
qwen2-vl-7b
using vllm with commands:Please forgive me for the complex parameter settings, as I conducted numerous searches and attempts to successfully deploy and added many parameters to ensure it works.
And my device configuration is as follows:
The package list of my python virtual environment is as follows:
Here is the exception I found:
During the inference, I found that if the prompt is short (or there are few images passed in), the model runs normally in most cases. However, when the prompt is long (or there are many images passed in), the VLLM program will get stuck or generate errors and output the following error message:
The subprocesses of vllm also do not automatically terminate and continue to occupy the GPU unless you manually kill them.
I have tried various measures, including but not limited to: changing the
-tp
parameter to-pp
, adding the--disable-custom-all-reduce
parameter, reducing the--gpu-memory-utilization
, and upgrading thevllm
version, but the situation has not improved yet.All I hoped is someone can help me to ensure that qwen2-vl-7b can perform inference stably within its context length range(32768, including image tokens) without experiencing llm_engine crashes or similar issues. 🐧
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: