Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529

Closed
2 of 4 tasks
Hao-YunDeng opened this issue Apr 30, 2024 · 7 comments · Fixed by #1688
Closed
2 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@Hao-YunDeng
Copy link

Hao-YunDeng commented Apr 30, 2024

System Info

GPU: NVIDIA A100
Driver Version: 545.23.08
CUDA: 12.3
versions:

Model: zephyr-7b-beta

Who can help?

@kaiyux @byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

step 1:

  • python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

  • trtllm-build --checkpoint_dir zephyr-7b-beta-converted \
    --output_dir zephyr-7b-beta-trt-engine \
    --workers 1 \
    --remove_input_padding enable \
    --context_fmha enable \
    --gpt_attention_plugin float16 \
    --gemm_plugin float16 \
    --paged_kv_cache enable \
    --max_num_tokens 65536 \
    --max_batch_size 32 \
    --max_input_len 16384 \
    --multi_block_mode enable \
    --strongly_typed

step 3 tensorrtllm_backend parameters:

  • MODEL_PATH=zephyr-7b-beta
  • MODEL_PIPELINE_NAME=triton_model_repo
  • MAX_BATCH_SIZE=32
  • ENGINE_PATH=zephyr-7b-beta-trt-engine
  • MAX_ATTENTION_WINDOW_SIZE=4096
  • KV_CACHE_FREE_GPU_MEM_FRACTION=0.5
  • batch_scheduler_policy=guaranteed_no_evict
  • python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
  • python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
  • python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
  • python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
  • python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy}
  • python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt

step 4:

  • run inference on a long input text ~ 7000 tokens

Expected behavior

"response": {
    "context_logits": 0.0,
    "cum_log_probs": 0.0,
    "generation_logits": 0.0,
    "model_name": "ensemble",
    "model_version": "1",
    "output_log_probs": [
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0
    ],
    "sequence_end": false,
    "sequence_id": 0,
    "sequence_start": false,
    "text_output": "1. <NAME> will set up an account and a control tower for the POC.\n2. <NAME> will check if the \"raise hand\" feature is available with the Realwear device.\n3. <NAME> will start the POC next week."
},
"status_code": 200,
"request_time": 2.1334147453308105

}

actual behavior

"response": {
    "context_logits": 0.0,
    "cum_log_probs": 0.0,
    "generation_logits": 0.0,
    "model_name": "ensemble",
    "model_version": "1",
    "output_log_probs": [
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0
    ],
    "sequence_end": false,
    "sequence_id": 0,
    "sequence_start": false,
    "text_output": "\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6\u68a6"
},
"status_code": 200,
"request_time": 10.642920732498169

}

additional notes

The issue also persists for

   ... 
    --max_batch_size 8 \
    --max_input_len 16384 \
   ...

The issue disappears for

   ... 
    --max_batch_size 1 \
    --max_input_len 16384 \
   ...

The issue persists for actual input concurrency = 1, 2 or 3, and may disappear when the concurrency >= 4 (for parameters --max_batch_size 8 and --max_batch_size 32)

The issue does not occur for smoothquant zephyr-7b-beta model for any of above reported parameter sets

@Hao-YunDeng Hao-YunDeng added the bug Something isn't working label Apr 30, 2024
@Hao-YunDeng
Copy link
Author

update: we have identified the issue was caused by the parameter "--multi_block_mode", which was recommended by you to improve computing efficiency in long sequence situation. Removing it resolves our issue for the moment. But please look into it.

@Hao-YunDeng
Copy link
Author

update: this issue happens when multi_block_mode is on and number of input tokens > ~5000.

Please look into it.

@PerkzZheng
Copy link
Collaborator

Thanks. I will take a look at it.
it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@fan-niu
Copy link

fan-niu commented May 20, 2024

Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@PerkzZheng Is there any progress on this issue?

@PerkzZheng
Copy link
Collaborator

Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@PerkzZheng Is there any progress on this issue?

could you provide the prompt (over 7K tokens) you are using ?

@Hao-YunDeng
Copy link
Author

Hao-YunDeng commented May 20, 2024

Thanks. I will take a look at it. it looks there are some issues when multi_block_mode + sliding_window_attention works together.

@PerkzZheng Is there any progress on this issue?

could you provide the prompt (over 7K tokens) you are using ?

it does not require any specific input; in fact, any input longer than 5k tokens will trigger this error. You can use any long enough texts like CNN news to replicate the error.

@PerkzZheng
Copy link
Collaborator

@Hao-YunDeng the fix will be included in next week's update (Tuesday). Feel free to give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants