Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error happened when quantizate Qwen2.5-14B-Instruct by SmoothQuant #2319

Open
liu21yd opened this issue Oct 11, 2024 · 3 comments
Open

Error happened when quantizate Qwen2.5-14B-Instruct by SmoothQuant #2319

liu21yd opened this issue Oct 11, 2024 · 3 comments
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@liu21yd
Copy link

liu21yd commented Oct 11, 2024

TensorRT-LLM version:v0.13.0
GPU:A100

Convert script:

python3 convert_checkpoint.py \
                        --model_dir /model/Qwen2.5-14B-Instruct \
                        --output_dir /model/trt_engines/Qwen2.5-14B-Instruct \
                        --dtype float16 \
                        --smoothquant 0.5 \
                        --per_channel \
                        --per_token \
                        --tp_size 2 \
                        --pp_size 1 \
                        --calib_dataset /app/datasets/cnn_dailymail/train

ERROR:
Image

I changed line 300 and line 301 in tensorrt_llm/models/qwen/convert.py

        k_split = torch.split(k, k.shape[-1] // tp_size, dim=-1)
        v_split = torch.split(v, v.shape[-1] // tp_size, dim=-1)

I can convert the checkpoint successfully, but I get a new error when I try to build engine using trtllm-build.
Image

Who can help me?

@Superjomn Superjomn added bug Something isn't working triaged Issue has been triaged by maintainers labels Oct 16, 2024
@jershi425
Copy link
Collaborator

Hi @liu21yd, thank you for your feedback. This is indeed a bug. We will fix this in the next release. Before that, you can try this hot fix: #2370.

@a2382625920
Copy link

Can you normally accelerate the inference of the Qwen2.5-14B-Instruct model in TensorRT-LLM?

@Wonder-donbury
Copy link

Wonder-donbury commented Dec 2, 2024

Can you normally accelerate the inference of the Qwen2.5-14B-Instruct model in TensorRT-LLM?

in my case, the tokens per second tripled from 16tps to 65 tps (compared to llama.cpp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants