-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Abort Error when using disaggServerBenchmark #2518
Comments
please pass |
I have pass --dataset to disaggServerBenchmark, and the llama2-7b-tp1 and llama2-7b-tp2 of gptManagerBenchmark is ok. |
Could you
|
Thanks, I will try it. I install tensorrt_llm from source in my own container, and the env info is as mentioned above. |
Maybe you can try in the docker image built with instruction https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#building-a-tensorrt-llm-docker-image |
I rebuild tensorrt_llm, and now the error like this (There have an error about permissions): [40a8c9673b05:1447853] Read -1, expected 33554432, errno = 14
[40a8c9673b05:1447850] *** Process received signal ***
[40a8c9673b05:1447850] Signal: Segmentation fault (11)
[40a8c9673b05:1447850] Signal code: Invalid permissions (2)
[40a8c9673b05:1447850] Failing at address: 0x9c5c12400
[40a8c9673b05:1447850] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f39b8619520]
[40a8c9673b05:1447850] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7f39b877d7cd]
[40a8c9673b05:1447850] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7f39600ec244]
[40a8c9673b05:1447850] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7f3960048556]
[40a8c9673b05:1447850] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x201)[0x7f3960046811]
[40a8c9673b05:1447850] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7f39600f0ae5]
[40a8c9673b05:1447850] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x7e24)[0x7f39600f0e24]
[40a8c9673b05:1447850] [ 7] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f39b8a3f714]
[40a8c9673b05:1447850] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7f39b8a4c38d] [40a8c9673b05:1447850] [ 9] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_mprobe+0x52d)[0x7f39600432fd]
[40a8c9673b05:1447850] [10] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Mprobe+0xd7)[0x7f39b8b440e7]
[40a8c9673b05:1447850] [11] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm3mpi7MpiComm6mprobeEiiPP14ompi_message_tP20ompi_status_public_t+0x2a)[0x7f39be09be6a]
[40a8c9673b05:1447850] [12] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl19leaderRecvReqThreadEv+0x133)[0x7f39c03c4e23]
[40a8c9673b05:1447850] [13] /xxx/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930)[0x7f39bbee793
0]
[40a8c9673b05:1447850] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f39b866bac3] [40a8c9673b05:1447850] [15] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f39b86fd850]
[40a8c9673b05:1447850] *** End of error message ***
[40a8c9673b05:1447854] Read -1, expected 16777216, errno = 14
[40a8c9673b05:1447851] *** Process received signal ***
[40a8c9673b05:1447851] Signal: Segmentation fault (11)
[40a8c9673b05:1447851] Signal code: Invalid permissions (2)
[40a8c9673b05:1447851] Failing at address: 0x9a2d12600
[40a8c9673b05:1447851] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fb0f5419520]
[40a8c9673b05:1447851] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7fb0f557d7cd]
[40a8c9673b05:1447851] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7fb09c30a244]
[40a8c9673b05:1447851] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7fb09c165556]
[40a8c9673b05:1447851] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x201)[0x7fb09c163811]
[40a8c9673b05:1447851] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7fb09c30eae5]
[40a8c9673b05:1447851] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x7e24)[0x7fb09c30ee24]
[40a8c9673b05:1447851] [ 7] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fb0f583f714]
[40a8c9673b05:1447851] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7fb0f584c38d]
[40a8c9673b05:1447851] [ 9] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_mprobe+0x52d)[0x7fb09c1602fd]
[40a8c9673b05:1447851] [10] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Mprobe+0xd7)[0x7fb0f59440e7]
[40a8c9673b05:1447851] [11] [40a8c9673b05:1447855] Read -1, expected 16777216, errno = 14
[40a8c9673b05:1447852] *** Process received signal ***
[40a8c9673b05:1447852] Signal: Segmentation fault (11)
[40a8c9673b05:1447852] Signal code: Invalid permissions (2)
[40a8c9673b05:1447852] Failing at address: 0x9a2d12600
[40a8c9673b05:1447852] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc1fee19520]
[40a8c9673b05:1447852] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a67cd)[0x7fc1fef7d7cd]
[40a8c9673b05:1447852] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x3244)[0x7fc1a598e244]
[40a8c9673b05:1447852] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7fc1a53e8556]
[40a8c9673b05:1447852] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x201)[0x7fc1a53e6811]
[40a8c9673b05:1447852] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7fc1a5992ae5]
[40a8c9673b05:1447852] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x7db1)[0x7fc1a5992db1]
[40a8c9673b05:1447852] [ 7] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm3mpi7MpiComm6mprobeEiiPP14ompi_message_tP20ompi_status_public_t+0x2a)[0x7fb0fae9be6a]
[40a8c9673b05:1447851] [12] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fc1ff23f714]
[40a8c9673b05:1447852] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0x7fc1ff24c38d]
[40a8c9673b05:1447852] [ 9] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x24b)[0x7fc1ff3192db]
[40a8c9673b05:1447852] [10] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x5ea)[0x7fc1ff36d40a]
[40a8c9673b05:1447852] [11] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fc1ff36e6c1]
[40a8c9673b05:1447852] [12] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7fc1a534b640]
[40a8c9673b05:1447852] [13] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x121)[0x7fc1ff32d881]
[40a8c9673b05:1447852] [14] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl19leaderRecvReqThreadEv+0x133)[0x7fb0fd1c4e23]
[40a8c9673b05:1447851] [13] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm3mpi7MpiComm5bcastEPvmNS0_7MpiTypeEi+0x47)[0x7fc20489d7b7]
[40a8c9673b05:1447852] [15] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl16getNewReqWithIdsEiSt8optionalIfE+0x68b)[0x7fc206bb787b]
[40a8c9673b05:1447852] [16] /xxx/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930)[0x7fb0f8ce793
0]
[40a8c9673b05:1447851] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fb0f546bac3]
[40a8c9673b05:1447851] [15] /xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl16fetchNewRequestsEiSt8optionalIfE+0x59)[0x7fc206bc5949]
[40a8c9673b05:1447852] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7fb0f54fd850]
[40a8c9673b05:1447851] *** End of error message ***
/xxx/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl13executionLoopEv+0x3bd)[0x7fc206bc7f5d]
[40a8c9673b05:1447852] [18] /xxx/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930)[0x7fc2026e793
0]
[40a8c9673b05:1447852] [19] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fc1fee6bac3]
[40a8c9673b05:1447852] [20] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7fc1feefd850]
[40a8c9673b05:1447852] *** End of error message *** |
Maybe you can't start the |
Thanks, I have executed executorExampleAdvanced successfully. ./build/executorExampleAdvanced --engine_dir /data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1 --input_tokens_csv_file ./inputTokens.csv --use_orchestrator_mode --worker_executable_path ../../../cpp/build/tensorrt_llm/executor_worker/executorWorker The log of output: [TensorRT-LLM][INFO] Engine version 0.16.0.dev2024112600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024112600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 12869 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1112.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12853 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 346.17 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.16 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 26.40 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 761
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 23.78 GiB for max tokens in paged KV cache (48704).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Executor instance created by worker
[TensorRT-LLM][INFO] Reading input tokens from ./inputTokens.csv
[TensorRT-LLM][INFO] Number of requests: 3
[TensorRT-LLM][INFO] Creating request with 6 input tokens
[TensorRT-LLM][INFO] Creating request with 4 input tokens
[TensorRT-LLM][INFO] Creating request with 10 input tokens
[TensorRT-LLM][INFO] Got 20 tokens for seqIdx 0 for requestId 3
[TensorRT-LLM][INFO] Request id 3 is completed.
[TensorRT-LLM][INFO] Got 14 tokens for seqIdx 0 for requestId 2
[TensorRT-LLM][INFO] Request id 2 is completed.
[TensorRT-LLM][INFO] Got 16 tokens for seqIdx 0 for requestId 1
[TensorRT-LLM][INFO] Request id 1 is completed.
[TensorRT-LLM][INFO] Writing output tokens to outputTokens.csv
[TensorRT-LLM][INFO] Exiting.
[TensorRT-LLM][INFO] Orchestrator sendReq thread exiting
[TensorRT-LLM][INFO] Orchestrator recv thread exiting
[TensorRT-LLM][INFO] Leader recvReq thread exiting
[TensorRT-LLM][INFO] Leader sendThread exiting
[TensorRT-LLM][INFO] Refreshed the MPI local session |
Having trouble using nvcr.io/nvidia/pytorch:24.10-py3 -based containers? |
I will try it later, it seems my current env can use orchestrator mode. |
System Info
Who can help?
@ncomly-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python scripts/build_wheel.py --trt_root=/usr/local/tensorrt --clean --cuda_architectures='90-real' --benchmarks
mpirun -n 7 disaggServerBenchmark --context_engine_dirs /data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1,/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp2 --generation_engine_dirs /data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp1,/data/models/llm/trtllm_0.16.0.dev2024112600/llama2-7b-tp2
Expected behavior
success
actual behavior
additional notes
Thanks for your attention!
The text was updated successfully, but these errors were encountered: