zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529

Hao-YunDeng · 2024-04-30T06:41:48Z

System Info

GPU: NVIDIA A100
Driver Version: 545.23.08
CUDA: 12.3
versions:

Model: zephyr-7b-beta

Who can help?

@kaiyux @byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

step 1:

python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

trtllm-build --checkpoint_dir zephyr-7b-beta-converted \
--output_dir zephyr-7b-beta-trt-engine \
--workers 1 \
--remove_input_padding enable \
--context_fmha enable \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--paged_kv_cache enable \
--max_num_tokens 65536 \
--max_batch_size 32 \
--max_input_len 16384 \
--multi_block_mode enable \
--strongly_typed

step 3 tensorrtllm_backend parameters:

MODEL_PATH=zephyr-7b-beta
MODEL_PIPELINE_NAME=triton_model_repo
MAX_BATCH_SIZE=32
ENGINE_PATH=zephyr-7b-beta-trt-engine
MAX_ATTENTION_WINDOW_SIZE=4096
KV_CACHE_FREE_GPU_MEM_FRACTION=0.5
batch_scheduler_policy=guaranteed_no_evict
python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy}
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt

step 4:

run inference on a long input text ~ 7000 tokens

Expected behavior

"response": {
    "context_logits": 0.0,
    "cum_log_probs": 0.0,
    "generation_logits": 0.0,
    "model_name": "ensemble",
    "model_version": "1",
    "output_log_probs": [
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0
    ],
    "sequence_end": false,
    "sequence_id": 0,
    "sequence_start": false,
    "text_output": "1. <NAME> will set up an account and a control tower for the POC.\n2. <NAME> will check if the \"raise hand\" feature is available with the Realwear device.\n3. <NAME> will start the POC next week."
},
"status_code": 200,
"request_time": 2.1334147453308105

}

actual behavior

"response": {
    "context_logits": 0.0,
    "cum_log_probs": 0.0,
    "generation_logits": 0.0,
    "model_name": "ensemble",
    "model_version": "1",
    "output_log_probs": [
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0
    ],
    "sequence_end": false,
    "sequence_id": 0,
    "sequence_start": false,
    "text_output": "\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6"
},
"status_code": 200,
"request_time": 10.642920732498169

}

additional notes

The issue also persists for

   ... 
    --max_batch_size 8 \
    --max_input_len 16384 \
   ...

The issue disappears for

   ... 
    --max_batch_size 1 \
    --max_input_len 16384 \
   ...

The issue persists for actual input concurrency = 1, 2 or 3, and may disappear when the concurrency >= 4 (for parameters --max_batch_size 8 and --max_batch_size 32)

The issue does not occur for smoothquant zephyr-7b-beta model for any of above reported parameter sets

The text was updated successfully, but these errors were encountered:

Hao-YunDeng · 2024-05-06T19:27:02Z

update: we have identified the issue was caused by the parameter "--multi_block_mode", which was recommended by you to improve computing efficiency in long sequence situation. Removing it resolves our issue for the moment. But please look into it.

Hao-YunDeng · 2024-05-07T00:51:19Z

update: this issue happens when multi_block_mode is on and number of input tokens > ~5000.

Please look into it.

PerkzZheng · 2024-05-09T06:35:36Z

Thanks. I will take a look at it.
it looks there are some issues when multi_block_mode + sliding_window_attention works together.

fan-niu · 2024-05-20T02:53:14Z

Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@PerkzZheng Is there any progress on this issue?

PerkzZheng · 2024-05-20T06:27:26Z

Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@PerkzZheng Is there any progress on this issue?

could you provide the prompt (over 7K tokens) you are using ?

Hao-YunDeng · 2024-05-20T06:32:23Z

Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@PerkzZheng Is there any progress on this issue?

could you provide the prompt (over 7K tokens) you are using ?

it does not require any specific input; in fact, any input longer than 5k tokens will trigger this error. You can use any long enough texts like CNN news to replicate the error.

PerkzZheng · 2024-05-23T08:43:25Z

@Hao-YunDeng the fix will be included in next week's update (Tuesday). Feel free to give it a try.

Hao-YunDeng added the bug Something isn't working label Apr 30, 2024

byshiue assigned QiJune May 6, 2024

byshiue assigned PerkzZheng May 9, 2024

kaiyux mentioned this issue May 28, 2024

Update TensorRT-LLM #1688

Merged

kaiyux closed this as completed in #1688 May 28, 2024

kaiyux mentioned this issue Jul 17, 2024

TensorRT-LLM v0.11 Update #1969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529

zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529

Hao-YunDeng commented Apr 30, 2024 •

edited

Loading

Hao-YunDeng commented May 6, 2024

Hao-YunDeng commented May 7, 2024

PerkzZheng commented May 9, 2024

fan-niu commented May 20, 2024

PerkzZheng commented May 20, 2024

Hao-YunDeng commented May 20, 2024 •

edited

Loading

PerkzZheng commented May 23, 2024

zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529

zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529

Comments

Hao-YunDeng commented Apr 30, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

The issue also persists for

The issue disappears for

The issue persists for actual input concurrency = 1, 2 or 3, and may disappear when the concurrency >= 4 (for parameters --max_batch_size 8 and --max_batch_size 32)

The issue does not occur for smoothquant zephyr-7b-beta model for any of above reported parameter sets

Hao-YunDeng commented May 6, 2024

Hao-YunDeng commented May 7, 2024

PerkzZheng commented May 9, 2024

fan-niu commented May 20, 2024

PerkzZheng commented May 20, 2024

Hao-YunDeng commented May 20, 2024 • edited Loading

PerkzZheng commented May 23, 2024

Hao-YunDeng commented Apr 30, 2024 •

edited

Loading

Hao-YunDeng commented May 20, 2024 •

edited

Loading