You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@byshiue Hi, Its been a while since. Been trying to upgrade to llama 3.3 from 3.1 and running on the same error on both model versions. Getting this error:
trt-llm-build-1 | error: creating server: Internal - failed to load all models
trt-llm-build-1 | [1735845475.460529] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e5ee80 was not returned to mpool tl_ucp_req_mp
trt-llm-build-1 | [1735845475.474010] [dev-station:69 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x71ba40e77980 was not matched
trt-llm-build-1 | [1735845475.474543] [dev-station:68 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7418bf5e0cc0 was not matched
trt-llm-build-1 | [1735845475.475317] [dev-station:67 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7527b8e85240 was not matched
trt-llm-build-1 | [1735845475.475422] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68e00 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475434] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68f40 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475438] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e69080 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.477120] [dev-station:66 :0] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
Also why are you guys ignoring the input context length when building the engine with --use_paged_context_fmha enabled. Its essential to set context length to fit.
Best,
christianci
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Use a DGX Node H100, Quantize, build engine and try to run with latest commit
Expected behavior
Runs
actual behavior
Fails:
trt-llm-build-1 | error: creating server: Internal - failed to load all models
trt-llm-build-1 | [1735845475.460529] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e5ee80 was not returned to mpool tl_ucp_req_mp
trt-llm-build-1 | [1735845475.474010] [dev-station:69 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x71ba40e77980 was not matched
trt-llm-build-1 | [1735845475.474543] [dev-station:68 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7418bf5e0cc0 was not matched
trt-llm-build-1 | [1735845475.475317] [dev-station:67 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7527b8e85240 was not matched
trt-llm-build-1 | [1735845475.475422] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68e00 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475434] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68f40 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475438] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e69080 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.477120] [dev-station:66 :0] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
additional notes
None
The text was updated successfully, but these errors were encountered:
@nv-guomingz and @byshiue I discovered that the issue was a backend past commit conflict that wasn't updated on our end. Was able to run 3.3 on the DGX node. Thanks
System Info
tensorrtllm_backend
Backend commit7a56e091a788ccf042760cf2c63ea957efc398db
Who can help?
@byshiue Hi, Its been a while since. Been trying to upgrade to llama 3.3 from 3.1 and running on the same error on both model versions. Getting this error:
Also why are you guys ignoring the input context length when building the engine with
--use_paged_context_fmha
enabled. Its essential to set context length to fit.Best,
christianci
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Use a DGX Node H100, Quantize, build engine and try to run with latest commit
Expected behavior
Runs
actual behavior
Fails:
trt-llm-build-1 | error: creating server: Internal - failed to load all models
trt-llm-build-1 | [1735845475.460529] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e5ee80 was not returned to mpool tl_ucp_req_mp
trt-llm-build-1 | [1735845475.474010] [dev-station:69 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x71ba40e77980 was not matched
trt-llm-build-1 | [1735845475.474543] [dev-station:68 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7418bf5e0cc0 was not matched
trt-llm-build-1 | [1735845475.475317] [dev-station:67 :0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x7527b8e85240 was not matched
trt-llm-build-1 | [1735845475.475422] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68e00 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475434] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e68f40 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.475438] [dev-station:66 :0] mpool.c:54 UCX WARN object 0x72eda4e69080 {{cb|rcv_tag} recv length 4 host memory} was not returned to mpool ucp_requests
trt-llm-build-1 | [1735845475.477120] [dev-station:66 :0] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
additional notes
None
The text was updated successfully, but these errors were encountered: