Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpool Failure on H100 DGX node #2649

Closed
4 tasks
christian-ci opened this issue Jan 2, 2025 · 2 comments
Closed
4 tasks

Mpool Failure on H100 DGX node #2649

christian-ci opened this issue Jan 2, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@christian-ci
Copy link

System Info

  • x86_64
  • 2.0Ti
  • NVIDIA H100 80GB HBM3
  • tensorrtllm_backend Backend commit 7a56e091a788ccf042760cf2c63ea957efc398db
  • NVIDIA Driver ver: 550.127.05
  • Ubuntu 22.04

Who can help?

@byshiue Hi, Its been a while since. Been trying to upgrade to llama 3.3 from 3.1 and running on the same error on both model versions. Getting this error:

trt-llm-build-1  | error: creating server: Internal - failed to load all models
trt-llm-build-1  | [1735845475.460529] [dev-station:66   :0]           mpool.c:54   UCX  WARN  object 0x72eda4e5ee80 was not returned to mpool tl_ucp_req_mp
trt-llm-build-1  | [1735845475.474010] [dev-station:69   :0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x71ba40e77980 was not matched
trt-llm-build-1  | [1735845475.474543] [dev-station:68   :0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x7418bf5e0cc0 was not matched
trt-llm-build-1  | [1735845475.475317] [dev-station:67   :0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x7527b8e85240 was not matched
trt-llm-build-1  | [1735845475.475422] [dev-station:66   :0]           mpool.c:54   UCX  WARN  object 0x72eda4e68e00 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1  | [1735845475.475434] [dev-station:66   :0]           mpool.c:54   UCX  WARN  object 0x72eda4e68f40 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1  | [1735845475.475438] [dev-station:66   :0]           mpool.c:54   UCX  WARN  object 0x72eda4e69080 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1  | [1735845475.477120] [dev-station:66   :0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy

Also why are you guys ignoring the input context length when building the engine with --use_paged_context_fmha enabled. Its essential to set context length to fit.

Best,
christianci

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Use a DGX Node H100, Quantize, build engine and try to run with latest commit

Expected behavior

Runs

actual behavior

Fails:

trt-llm-build-1 | error: creating server: Internal - failed to load all models
trt-llm-build-1 | [1735845475.460529] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e5ee80 was not returned to mpool tl_ucp_req_mp
trt-llm-build-1 | [1735845475.474010] [dev-station:69 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x71ba40e77980 was not matched
trt-llm-build-1 | [1735845475.474543] [dev-station:68 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7418bf5e0cc0 was not matched
trt-llm-build-1 | [1735845475.475317] [dev-station:67 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7527b8e85240 was not matched
trt-llm-build-1 | [1735845475.475422] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68e00 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475434] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68f40 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475438] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e69080 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.477120] [dev-station:66 :0] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy

additional notes

None

@christian-ci christian-ci added the bug Something isn't working label Jan 2, 2025
@nv-guomingz
Copy link
Collaborator

Hi @byshiue would u please take a look on it?

@christian-ci
Copy link
Author

@nv-guomingz and @byshiue I discovered that the issue was a backend past commit conflict that wasn't updated on our end. Was able to run 3.3 on the DGX node. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants