MPI Abort Error when using disaggServerBenchmark #2518

zhangts20 · 2024-12-02T04:55:50Z

System Info

CPU: x86_64 (Ubuntu 20.04.6 LTS)
GPU: H100 * 8
CUDA: 12.5.1
TensorRT-LLM: The latest dev commit, 3856265
TensorRT: 10.6.0
Python: 3.10.14
Pytorch: 2.5.0

Who can help?

@ncomly-nvidia

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Build tensorrt_llm: python scripts/build_wheel.py --trt_root=/usr/local/tensorrt --clean --cuda_architectures='90-real' --benchmarks
Do according to https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks/cpp#4launch-c-disaggserverbenchmark by building a llama2-7b-tp1 and a llama2-7b-tp2 with default build args
mpirun -n 7 disaggServerBenchmark --context_engine_dirs /data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1,/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp2 --generation_engine_dirs /data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1,/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp2

Expected behavior

success

actual behavior

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Will Launch benchmark with 2 context engines and 2 generation engines. Context Engines:/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1,/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp2, ; Generation Engines:/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1,/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp2, ;
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[40a8c9673b05:1334630] Read -1, expected 16777216, errno = 14
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[40a8c9673b05:1334629] Read -1, expected 33554432, errno = 14
[40a8c9673b05:1334631] Read -1, expected 16777216, errno = 14
[40a8c9673b05:1334621] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[40a8c9673b05:1334621] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

additional notes

Thanks for your attention!

The text was updated successfully, but these errors were encountered:

chuangz0 · 2024-12-02T08:52:07Z

please pass --dataset to the commands and verify your engine by gptManagerBenchmark first.

zhangts20 · 2024-12-02T13:31:16Z

please pass --dataset to the commands and verify your engine by gptManagerBenchmark first.

I have pass --dataset to disaggServerBenchmark, and the llama2-7b-tp1 and llama2-7b-tp2 of gptManagerBenchmark is ok.

chuangz0 · 2024-12-02T14:17:33Z

Could you
comment out

        for (int sig : {SIGABRT, SIGSEGV})
        {
            __sighandler_t previousHandler = nullptr;
            if (forwardAbortToParent)
            {
                previousHandler = std::signal(sig,
                    [](int signal)
                    {
#ifndef _WIN32
                        pid_t parentProcessId = getppid();
                        kill(parentProcessId, SIGKILL);
#endif
                        MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
                    });
            }
            else
            {
                previousHandler = std::signal(sig, [](int signal) { MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); });
            }
            TLLM_CHECK_WITH_INFO(previousHandler != SIG_ERR, "Signal handler setup failed");
        }
``` in `cpp/tensorrt_llm/common/mpiUtils.cpp`
and try compile it and run .

Which container do you use?

zhangts20 · 2024-12-03T01:08:35Z

Could you comment out

        for (int sig : {SIGABRT, SIGSEGV})
        {
            __sighandler_t previousHandler = nullptr;
            if (forwardAbortToParent)
            {
                previousHandler = std::signal(sig,
                    [](int signal)
                    {
#ifndef _WIN32
                        pid_t parentProcessId = getppid();
                        kill(parentProcessId, SIGKILL);
#endif
                        MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
                    });
            }
            else
            {
                previousHandler = std::signal(sig, [](int signal) { MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); });
            }
            TLLM_CHECK_WITH_INFO(previousHandler != SIG_ERR, "Signal handler setup failed");
        }
``` in `cpp/tensorrt_llm/common/mpiUtils.cpp`
and try compile it and run .

Which container do you use?

Thanks, I will try it. I install tensorrt_llm from source in my own container, and the env info is as mentioned above.

chuangz0 · 2024-12-03T01:22:49Z

Maybe you can try in the docker image built with instruction https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#building-a-tensorrt-llm-docker-image
We only have tested disaggServer in docker image base on nvcr.io/nvidia/pytorch:24.10-py3.

zhangts20 · 2024-12-03T02:26:38Z

Could you comment out

        for (int sig : {SIGABRT, SIGSEGV})
        {
            __sighandler_t previousHandler = nullptr;
            if (forwardAbortToParent)
            {
                previousHandler = std::signal(sig,
                    [](int signal)
                    {
#ifndef _WIN32
                        pid_t parentProcessId = getppid();
                        kill(parentProcessId, SIGKILL);
#endif
                        MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
                    });
            }
            else
            {
                previousHandler = std::signal(sig, [](int signal) { MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); });
            }
            TLLM_CHECK_WITH_INFO(previousHandler != SIG_ERR, "Signal handler setup failed");
        }
``` in `cpp/tensorrt_llm/common/mpiUtils.cpp`
and try compile it and run .

Which container do you use?

I rebuild tensorrt_llm, and now the error like this (There have an error about permissions):

[40a8c9673b05:1447853] Read -1, expected 33554432, errno = 14
[40a8c9673b05:1447850] *** Process received signal ***
[40a8c9673b05:1447850] Signal: Segmentation fault (11)
[40a8c9673b05:1447850] Signal code: Invalid permissions (2)
[40a8c9673b05:1447850] Failing at address: 0x9c5c12400
[40a8c9673b05:1447850] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f39b8619520]
[40a8c9673b05:1447850] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7f39b877d7cd]
[40a8c9673b05:1447850] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7f39600ec244]
[40a8c9673b05:1447850] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7f3960048556]
[40a8c9673b05:1447850] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x201)[0x7f3960046811]
[40a8c9673b05:1447850] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7f39600f0ae5]
[40a8c9673b05:1447850] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x7e24)[0x7f39600f0e24]
[40a8c9673b05:1447850] [ 7] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f39b8a3f714]
[40a8c9673b05:1447850] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7f39b8a4c38d]                                                                                                                                 [40a8c9673b05:1447850] [ 9] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_mprobe+0x52d)[0x7f39600432fd]
[40a8c9673b05:1447850] [10] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Mprobe+0xd7)[0x7f39b8b440e7]
[40a8c9673b05:1447850] [11] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm3mpi7MpiComm6mprobeEiiPP14ompi_message_tP20ompi_status_public_t+0x2a)[0x7f39be09be6a]
[40a8c9673b05:1447850] [12] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl19leaderRecvReqThreadEv+0x133)[0x7f39c03c4e23]
[40a8c9673b05:1447850] [13] /xxx/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930)[0x7f39bbee793
0]
[40a8c9673b05:1447850] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f39b866bac3]                                                                                                                                                       [40a8c9673b05:1447850] [15] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f39b86fd850]
[40a8c9673b05:1447850] *** End of error message ***
[40a8c9673b05:1447854] Read -1, expected 16777216, errno = 14
[40a8c9673b05:1447851] *** Process received signal ***
[40a8c9673b05:1447851] Signal: Segmentation fault (11)
[40a8c9673b05:1447851] Signal code: Invalid permissions (2)
[40a8c9673b05:1447851] Failing at address: 0x9a2d12600
[40a8c9673b05:1447851] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fb0f5419520]
[40a8c9673b05:1447851] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7fb0f557d7cd]
[40a8c9673b05:1447851] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7fb09c30a244]
[40a8c9673b05:1447851] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7fb09c165556]
[40a8c9673b05:1447851] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x201)[0x7fb09c163811]
[40a8c9673b05:1447851] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7fb09c30eae5]
[40a8c9673b05:1447851] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x7e24)[0x7fb09c30ee24]
[40a8c9673b05:1447851] [ 7] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fb0f583f714]
[40a8c9673b05:1447851] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7fb0f584c38d]
[40a8c9673b05:1447851] [ 9] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_mprobe+0x52d)[0x7fb09c1602fd]
[40a8c9673b05:1447851] [10] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Mprobe+0xd7)[0x7fb0f59440e7]
[40a8c9673b05:1447851] [11] [40a8c9673b05:1447855] Read -1, expected 16777216, errno = 14
[40a8c9673b05:1447852] *** Process received signal ***
[40a8c9673b05:1447852] Signal: Segmentation fault (11)
[40a8c9673b05:1447852] Signal code: Invalid permissions (2)
[40a8c9673b05:1447852] Failing at address: 0x9a2d12600
[40a8c9673b05:1447852] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc1fee19520]
[40a8c9673b05:1447852] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7fc1fef7d7cd]
[40a8c9673b05:1447852] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7fc1a598e244]
[40a8c9673b05:1447852] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7fc1a53e8556]
[40a8c9673b05:1447852] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x201)[0x7fc1a53e6811]
[40a8c9673b05:1447852] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7fc1a5992ae5]
[40a8c9673b05:1447852] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x7db1)[0x7fc1a5992db1]
[40a8c9673b05:1447852] [ 7] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm3mpi7MpiComm6mprobeEiiPP14ompi_message_tP20ompi_status_public_t+0x2a)[0x7fb0fae9be6a]
[40a8c9673b05:1447851] [12] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fc1ff23f714]
[40a8c9673b05:1447852] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7fc1ff24c38d]
[40a8c9673b05:1447852] [ 9] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x24b)[0x7fc1ff3192db]
[40a8c9673b05:1447852] [10] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x5ea)[0x7fc1ff36d40a]
[40a8c9673b05:1447852] [11] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fc1ff36e6c1]
[40a8c9673b05:1447852] [12] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fc1a534b640]
[40a8c9673b05:1447852] [13] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x121)[0x7fc1ff32d881]
[40a8c9673b05:1447852] [14] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl19leaderRecvReqThreadEv+0x133)[0x7fb0fd1c4e23]
[40a8c9673b05:1447851] [13] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm3mpi7MpiComm5bcastEPvmNS0_7MpiTypeEi+0x47)[0x7fc20489d7b7]
[40a8c9673b05:1447852] [15] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl16getNewReqWithIdsEiSt8optionalIfE+0x68b)[0x7fc206bb787b]
[40a8c9673b05:1447852] [16] /xxx/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930)[0x7fb0f8ce793
0]
[40a8c9673b05:1447851] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fb0f546bac3]
[40a8c9673b05:1447851] [15] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl16fetchNewRequestsEiSt8optionalIfE+0x59)[0x7fc206bc5949]
[40a8c9673b05:1447852] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7fb0f54fd850]
[40a8c9673b05:1447851] *** End of error message ***
/xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x3bd)[0x7fc206bc7f5d]
[40a8c9673b05:1447852] [18] /xxx/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930)[0x7fc2026e793
0]
[40a8c9673b05:1447852] [19] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fc1fee6bac3]
[40a8c9673b05:1447852] [20] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7fc1feefd850]
[40a8c9673b05:1447852] *** End of error message ***

chuangz0 · 2024-12-03T02:49:42Z

Maybe you can't start the trtllm executor in orchestrator mode in your container environment.
Could you run executorExampleAdvanced in examples/cpp/executor with orchestartor mode?
If your mpi is based on UCX,please set env UCX_MEMTYPE_CACHE=n
Please make sure your mpi enable cuda aware.
I highly recommend using and docker image based nvcr.io/nvidia/pytorch:24.10-py3.

zhangts20 · 2024-12-03T04:30:57Z

Maybe you can't start the trtllm executor in orchestrator mode in your container environment. Could you run executorExampleAdvanced in examples/cpp/executor with orchestartor mode? If your mpi is based on UCX,please set env UCX_MEMTYPE_CACHE=n Please make sure your mpi enable cuda aware. I highly recommend using and docker image based nvcr.io/nvidia/pytorch:24.10-py3.

Thanks, I have executed executorExampleAdvanced successfully.

./build/executorExampleAdvanced --engine_dir /data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1 --input_tokens_csv_file ./inputTokens.csv --use_orchestrator_mode --worker_executable_path ../../../cpp/build/tensorrt_llm/executor_worker/executorWorker

The log of output:

[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024112600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024112600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0                                                                                                                                                                                [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 12869 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1112.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12853 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 346.17 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.16 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 26.40 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 761
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 23.78 GiB for max tokens in paged KV cache (48704).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Reading input tokens from ./inputTokens.csv
[TensorRT-LLM][INFO] Number of requests: 3
[TensorRT-LLM][INFO] Creating request with 6 input tokens
[TensorRT-LLM][INFO] Creating request with 4 input tokens
[TensorRT-LLM][INFO] Creating request with 10 input tokens
[TensorRT-LLM][INFO] Got 20 tokens for seqIdx 0 for requestId 3
[TensorRT-LLM][INFO] Request id 3 is completed.
[TensorRT-LLM][INFO] Got 14 tokens for seqIdx 0 for requestId 2
[TensorRT-LLM][INFO] Request id 2 is completed.
[TensorRT-LLM][INFO] Got 16 tokens for seqIdx 0 for requestId 1
[TensorRT-LLM][INFO] Request id 1 is completed.
[TensorRT-LLM][INFO] Writing output tokens to outputTokens.csv
[TensorRT-LLM][INFO] Exiting.
[TensorRT-LLM][INFO] Orchestrator sendReq thread exiting
[TensorRT-LLM][INFO] Orchestrator recv thread exiting
[TensorRT-LLM][INFO] Leader recvReq thread exiting
[TensorRT-LLM][INFO] Leader sendThread exiting
[TensorRT-LLM][INFO] Refreshed the MPI local session

chuangz0 · 2024-12-03T07:29:23Z

Having trouble using nvcr.io/nvidia/pytorch:24.10-py3 -based containers?

zhangts20 · 2024-12-04T01:21:50Z

Having trouble using nvcr.io/nvidia/pytorch:24.10-py3 -based containers?

I will try it later, it seems my current env can use orchestrator mode.

zhangts20 added the bug Something isn't working label Dec 2, 2024

hello-11 assigned chuangz0 Dec 2, 2024

hello-11 added the triaged Issue has been triaged by maintainers label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI Abort Error when using disaggServerBenchmark #2518

MPI Abort Error when using disaggServerBenchmark #2518

zhangts20 commented Dec 2, 2024

chuangz0 commented Dec 2, 2024

zhangts20 commented Dec 2, 2024

chuangz0 commented Dec 2, 2024 •

edited

Loading

zhangts20 commented Dec 3, 2024

chuangz0 commented Dec 3, 2024

zhangts20 commented Dec 3, 2024

chuangz0 commented Dec 3, 2024 •

edited

Loading

zhangts20 commented Dec 3, 2024

chuangz0 commented Dec 3, 2024

zhangts20 commented Dec 4, 2024

MPI Abort Error when using disaggServerBenchmark #2518

MPI Abort Error when using disaggServerBenchmark #2518

Comments

zhangts20 commented Dec 2, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

chuangz0 commented Dec 2, 2024

zhangts20 commented Dec 2, 2024

chuangz0 commented Dec 2, 2024 • edited Loading

zhangts20 commented Dec 3, 2024

chuangz0 commented Dec 3, 2024

zhangts20 commented Dec 3, 2024

chuangz0 commented Dec 3, 2024 • edited Loading

zhangts20 commented Dec 3, 2024

chuangz0 commented Dec 3, 2024

zhangts20 commented Dec 4, 2024

chuangz0 commented Dec 2, 2024 •

edited

Loading

chuangz0 commented Dec 3, 2024 •

edited

Loading