-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529
Comments
update: we have identified the issue was caused by the parameter "--multi_block_mode", which was recommended by you to improve computing efficiency in long sequence situation. Removing it resolves our issue for the moment. But please look into it. |
update: this issue happens when multi_block_mode is on and number of input tokens > ~5000. Please look into it. |
Thanks. I will take a look at it. |
@PerkzZheng Is there any progress on this issue? |
could you provide the prompt (over 7K tokens) you are using ? |
it does not require any specific input; in fact, any input longer than 5k tokens will trigger this error. You can use any long enough texts like CNN news to replicate the error. |
@Hao-YunDeng the fix will be included in next week's update (Tuesday). Feel free to give it a try. |
System Info
GPU: NVIDIA A100
Driver Version: 545.23.08
CUDA: 12.3
versions:
Model: zephyr-7b-beta
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
step 1:
step 2:
--output_dir zephyr-7b-beta-trt-engine \
--workers 1 \
--remove_input_padding enable \
--context_fmha enable \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--paged_kv_cache enable \
--max_num_tokens 65536 \
--max_batch_size 32 \
--max_input_len 16384 \
--multi_block_mode enable \
--strongly_typed
step 3 tensorrtllm_backend parameters:
step 4:
Expected behavior
}
actual behavior
}
additional notes
The issue also persists for
The issue disappears for
The issue persists for actual input concurrency = 1, 2 or 3, and may disappear when the concurrency >= 4 (for parameters --max_batch_size 8 and --max_batch_size 32)
The issue does not occur for smoothquant zephyr-7b-beta model for any of above reported parameter sets
The text was updated successfully, but these errors were encountered: