Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pt_main_thread processes are not killed after main process is killed in MP distributed executor backend #6766

Open
oandreeva-nv opened this issue Jul 25, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@oandreeva-nv
Copy link

oandreeva-nv commented Jul 25, 2024

Your current environment

PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.5.82
CUDA_MODULE_LOADING set to: LAZY
GPU models:

A100s 

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

I am trying to understand the vllm's workflow for distributed serving via multiprocessing. The original setup is deploying a model with tensor parallel size = 2 through Triton Inference Server and distributed_executor_backend: mp . While inference is going well, when server is shutting down , 2 processes pt_main_thread are not killed and their status is State: S (sleeping) .

The closes reproducer outside of Triton is this:

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.utils import random_uuid
import time 
import asyncio

SAMPLING_PARAMETERS = {"temperature": 0, "top_p": 1}

VLLM_ENGINE_CONFIG = {
    "model":"facebook/opt-125m",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.5,
    "enforce_eager": "true",
    "tensor_parallel_size":2
}

PROMPTS = [
    "The most dangerous animal is",
    "The capital of France is",
    "The future of AI is",
]

async def generate_python_vllm_output(prompt, llm_engine):
    request_id = random_uuid()
    sampling_params = SamplingParams(**SAMPLING_PARAMETERS)
    python_vllm_output = None
    last_output = None

    async for vllm_output in llm_engine.generate(prompt, sampling_params, request_id):
        last_output = vllm_output

    if last_output:
        python_vllm_output = [
            (prompt + output.text).encode("utf-8") for output in last_output.outputs
        ]

    return python_vllm_output


llm_engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**VLLM_ENGINE_CONFIG))
python_vllm_output = []
for i in range(len(PROMPTS * 1000)):
    python_vllm_output.extend(
        asyncio.run(generate_python_vllm_output(PROMPTS[i], llm_engine))
    )

And the workflow is the following:

# ps
    PID TTY          TIME CMD
      1 pts/0    00:00:00 bash
  21346 pts/0    00:00:00 top
  21927 pts/0    00:00:00 top
  22463 pts/0    00:00:00 ps
# python3 vllm_reproducer.py &
...
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.38it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.37it/s]

INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
(VllmWorkerProcess pid=22534) INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
INFO 07-25 00:18:58 distributed_gpu_executor.py:56] # GPU blocks: 68037, # CPU blocks: 14563

# pkill -9  python3
# ps
    PID TTY          TIME CMD
      1 pts/0    00:00:00 bash
  21346 pts/0    00:00:00 top
  21927 pts/0    00:00:00 top
  22465 pts/0    00:00:22 pt_main_thread
  22534 pts/0    00:00:14 pt_main_thread
  22576 pts/0    00:00:00 python3 <defunct>
  22745 pts/0    00:00:00 ps

And same, the above 2 processes are in the sleeping state based on cat /proc/_PID_/status

Any insights on vllm's distributed serving with multiprocessing is greatly appreciated.

@oandreeva-nv oandreeva-nv added the bug Something isn't working label Jul 25, 2024
@KuntaiDu
Copy link
Collaborator

I also observed similar thing... My current workaround is to pkill -f pt_main_thread after terminating vLLM server.

@oandreeva-nv
Copy link
Author

pkill -f pt_main_thread after terminating vLLM server.

Unfortunately, this is not a viable solution for me

@yums-gao
Copy link

same issue here. pkill -f does not work for my case neither.

@j-klesen
Copy link

pkill -f pt_main_thread after terminating vLLM server.

This did not help in my case. I had to do:

top -b -n 1 | grep pt_main_thread | awk '{print $1}' | xargs kill -9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants