Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552

Closed
2 of 4 tasks
sleepwalker2017 opened this issue May 7, 2024 · 2 comments
Assignees
Labels
bug Something isn't working stale triaged Issue has been triaged by maintainers

Comments

@sleepwalker2017
Copy link

sleepwalker2017 commented May 7, 2024

System Info

GPU 2* A30, TRT-LLM branch main, commid id: 66ef1df

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

MODEL_CHECKPOINT=/data/models/vicuna-7b-v1.5/
CONVERTED_CHECKPOINT=Llama-7b-hf-ckpt

DTYPE=float16
TP=2

echo "step 1: convert checkpoint"
# Build lora enabled engine
python convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
                              --output_dir ${CONVERTED_CHECKPOINT} \
                              --dtype ${DTYPE} \
                              --tp_size ${TP} \
                              --pp_size 1

SOURCE_LORA=/data/Llama2-Chinese-7b-Chat-LoRA/
#SOURCE_LORA=/data/llama2-7b-lora.tar.gz
CPP_LORA=chinese-llama-2-lora-7b-cpp

EG_DIR=/tmp/lora-eg

PP=1
MAX_LEN=1024
MAX_BATCH=16
TOKENIZER=/data/models/vicuna-7b-v1.5/
LORA_ENGINE=Llama-2-7b-hf-engine
NUM_LORAS=(8)
NUM_REQUESTS=200

echo "step 2: trtllm-build"
trtllm-build \
    --checkpoint_dir ${CONVERTED_CHECKPOINT} \
    --output_dir ${LORA_ENGINE} \
    --max_batch_size ${MAX_BATCH} \
    --max_input_len $MAX_LEN \
    --max_output_len $MAX_LEN \
    --gpt_attention_plugin float16 \
    --paged_kv_cache enable \
    --remove_input_padding enable \
    --gemm_plugin float16 \
    --lora_plugin float16 \
    --use_paged_context_fmha enable \
    --use_custom_all_reduce disable \
    --lora_target_modules attn_qkv attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h
echo "step 3: Convert LoRA to cpp format"
# Convert LoRA to cpp format
python ../hf_lora_convert.py \
    -i $SOURCE_LORA \
    --storage-type $DTYPE \
    -o $CPP_LORA

echo "step 4: prepare dataset for non-lora requests"
mkdir -p $EG_DIR/data
python ../../benchmarks/cpp/prepare_dataset.py \
    --output ${EG_DIR}/data/token-norm-dist.json \
    --request-rate -1 \
    --time-delay-dist constant \
    --tokenizer $TOKENIZER \
    token-norm-dist \
    --num-requests $NUM_REQUESTS \
    --input-mean 256 --input-stdev 16 --output-mean 128 --output-stdev 24

echo "step 5: prepare dataset for lora requests"
for nloras in ${NUM_LORAS[@]}; do
    python ../../benchmarks/cpp/prepare_dataset.py \
        --output "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
        --request-rate -1 \
        --time-delay-dist constant \
        --rand-task-id 0 $(( $nloras - 1 )) \
        --tokenizer $TOKENIZER \
        token-norm-dist \
        --num-requests $NUM_REQUESTS \
        --input-mean 256 --input-stdev 16 --output-mean 128 --output-stdev 24
done

mkdir -p ${EG_DIR}/log-base-lora

NUM_LAYERS=32
NUM_LORA_MODS=8
MAX_LORA_RANK=8
EOS_ID=-1
mpirun -n ${TP} --allow-run-as-root --output-filename ${EG_DIR}/log-base-lora \
    ../../cpp/build/benchmarks/gptManagerBenchmark \
    --engine_dir $LORA_ENGINE \
    --type IFB \
    --dataset "${EG_DIR}/data/token-norm-dist-lora-8.json" \
    --lora_host_cache_bytes 8589934592 \
    --lora_num_device_mod_layers $(( 8 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
    --kv_cache_free_gpu_mem_fraction 0.80 \
    --log_level info \
    --eos_id ${EOS_ID}

Expected behavior

Failed to run gptManager benchmark

actual behavior

[TensorRT-LLM][ERROR] Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182)
1       0x5572c6dedde9 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f56c6cd5378 /data/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x69c378) [0x7f56c6cd5378]
3       0x7f56c8c3f03f tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 127
4       0x7f56c8c03078 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 1464
5       0x7f56c8c0342a tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 170
6       0x7f56c64dd253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f56c64dd253]
7       0x7f56c624cac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f56c624cac3]
8       0x7f56c62de850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f56c62de850]

additional notes

none

@VincentJing
Copy link

The LoRA cpp format you can refer to this link. the benchmark script you can refer to this.

@byshiue byshiue added the triaged Issue has been triaged by maintainers label May 29, 2024
@nv-guomingz
Copy link
Collaborator

Hi @sleepwalker2017 do u have any update on this ticket? If not, we'll close it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants