Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

DreamGenX · 2024-06-16T18:31:21Z

System Info

This was tested o na tp=4 4xH100 SXM setup
I tested these 2 releases: Update TensorRT-LLM #1763 and Update TensorRT-LLM #1725

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens).

The outputs from TensorRT-LLM (obtained through running the run.py script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below).

When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns.

Here some some of the ways I tried to build the TensorRT-LLM engine:

I tried all of float16, bfloat16 and also fp8 quantization
I tried context_fmha enable/disable and and also context_fmha_fp32_acc enable/disable
I tried use_custom_all_reduce enable/disable
I tried gemm_plugin auto/disable
I tried various values for presencePenalty and frequencyPenalty (unset, 0.05, 0.1, 0.3), bust most tests were with 0.1 for both

One concrete example:

python convert_checkpoint.py \
--model_dir /workspace/llama3-70b \
--output_dir /workspace/llama3-70b-bf16-tp4 \
--dtype bfloat16 \
--tp_size 4

trtllm-build \
--checkpoint_dir /workspace/llama3-70b-bf16-tp4 \
--output_dir /workspace/llama3-70b-bf16-tp4-engine \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--use_custom_all_reduce disable \
--max_num_tokens 16384 \
--max_batch_size 24 \
--max_input_len 8192 \
--max_output_len 4096

I also tried running sequentially without batching, and even building the engine with max_batch_size 1 to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building with max_input_len 7424 and max_output_len 768 to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not).

Expected behavior

The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch.

actual behavior

The input would usually be some part of a story + instruction to continue the story. This is an example of an output.

 She looks up when she hears me set down her drink.

“Martini,” I say with a smile.

She smiles back at me with her eyes this time.

“Thank you,” she says.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

I don’t know what it is about her voice that makes me feel like she’s saying something else entirely.

The repetition is usually at a sentence level like this, but sometimes also several sentences repeat.

additional notes

I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-06-17T02:41:52Z

what's your input of above output tokens?

MagicRUBICK · 2024-06-24T07:55:11Z

add arg for converting checkpoint step: '--rotary_base 500000.0' for llama3.
mmlu score for llama3-70b: 0.788

DreamGenX · 2024-06-25T06:07:45Z

@MagicRUBICK I believe that should be inferred from the HF config when converting checkpoint. Was that not your experience? Could you share your config.json? Here's mine when I did not use --rotary_base: #1780 (comment) you can see it has pretrained_config > rotary_base: 500000.0 anyway.

ZihanLiao · 2024-07-02T02:16:32Z

Facing the same issue.

netanel-haber · 2024-07-09T10:37:42Z

@ZihanLiao, can you elaborate please?

netanel-haber · 2024-07-11T14:45:52Z

Hello @DreamGenX - I've been working on replicating your issue - namely, by first running generation on HF, as you mentioned in the title.

Note: I assumed you were using llama3-70b - not the llama3-70b-instruct. Please correct me if I'm wrong.

I wrote a small script that replicates your environment, based on what I inferred from your report, see below.
Currently, I see that running almost the default generation config, with a very large max_new_tokens=4096 and a relatively low temperature=0.1 commonly produces looping text similar to your output above [Passing a high repetition_penalty does rectify this] - so I'm currently under the assumption that this behavior is due to the model, not trtllm per se.

Maybe you can elaborate as to the exact replication needed to where HF and trtllm differ significantly.

Replication Script

nvidia-docker run -v <Your Path>:/workspace/model:rw -it nvcr.io/nvidia/pytorch:24.06-py3 bash
cd ~
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/
git checkout db4edea1e1359bcfcac7bbb87c1b639b5611c721 # The commit SHA from the later release you linked to above: https://github.com/NVIDIA/TensorRT-LLM/pull/1763 
cd ..
python -m pip install virtualenv
python -m virtualenv .venv
source .venv/bin/activate
python -m pip install -r TensorRT-LLM/examples/llama/requirements.txt
python script.py

script.py:

import transformers
import torch
from pathlib import Path

REPETITION_PENALTY = 1.0 # This is the default repetition_penalty in transformers, and it means no penalty - see image below
model_path = Path("/workspace/llama3-70b")

pipeline = transformers.pipeline("text-generation", model=Path("/app/llama/hf"), model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")

result = pipeline("Hey how are you doing today?", temperature=0.1, max_new_tokens=4096, repetition_penalty=REPETITION_PENALTY)
print(result)

Default repetition_penalty=1

DreamGenX · 2024-07-11T15:01:39Z

Hi @netanel-haber thank you for looking into this. I am using a custom fine tune of the Llama 3 70B instruct model. The favorable results were with vLLM, using:

temperature 1.0
frequency penalty 0.1
presence penalty 0.1

To make the comparison fair I did not use samplers like min P which TensorRT-LLM does not support.

netanel-haber · 2024-07-11T15:05:50Z

I see - is there public access to said finetune?

DreamGenX · 2024-07-11T15:39:14Z

@netanel-haber I can share access if you provide your HF username.

netanel-haber · 2024-07-11T15:51:54Z

Sure:

[email protected]

Nvidia-NetanelHaber

Thank you.

DreamGenX · 2024-07-11T17:01:30Z

Awesome, shared together with some example inputs (not all will trigger repetition, with with TRT roughly 15-30% should).

netanel-haber · 2024-07-11T17:54:56Z

Received, thanks!

netanel-haber · 2024-07-18T15:24:03Z

Hey - sorry for the delay.

I hope I'm not missing something trivial/critical here - I suspect the discrepancy may be due to greedy sampling when running TRTLLM.

TRTLLM

When using run.py, the default is top_k=1, i.e. greedy sampling:

parser = add_common_args(parser) -> parser.add_argument('--top_k', type=int, default=1).

vLLM

vLLM, on the other hand, defaults to top_k=-1, which considers all tokens.

I ran vLLM with your fine-tune, on the sample inputs you provided, with temperature=0.0 for greedy sampling, and the penalties you provided.

The result showed numerous occurrences of looping similar to what you provided above. (I can provide outputs privately or publicly, if you prefer).

One hesitation is because you mentioned:

...as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching)...

Could you provide more context as to how you ran the GptManager (i.e., provide an actual snippet, if you don't mind)? Since there are many possible entry points for generation, I had trouble establishing for a fact that all of your TensorRT-LLM generations used top_k=1.

Let me know if this makes sense to you!

DreamGenX · 2024-07-18T15:51:26Z

@netanel-haber No worries.

I did several runs, one was top_k=50; top_p=0.9 and it also had that issue. But it's possible that something has changed in the meantime. Could you please share which commit / version of TRT-LLM you used and how you built your engine?

netanel-haber · 2024-07-21T15:36:15Z

Hey - I wanted to double-check before getting back to you.
Altogether, I ran 5 generations on the entire dataset of 100 sample inputs you provided, all with **max_output_len=4096**, frequency_penalty=0.1, presence_penalty=0.1:

vLLM

top_k=50, top_p=0.9:
1. temperature=1.0, tp_size=4
2. temperature=1.0, tp_size=2
3. (For good measure: temperature=0.1, tp_size=2)
The greedy vllm run [temperature=0.0] mentioned above

TRTLLM

Identical to vLLM.1.1: top_k=50, top_p=0.9, temperature=1.0, tp_size=4

I'll upload all 5 to your private HF repo [Every output is delimited by "&"*36]

Conclusion

Discounting the greedy generation, and just looking at the runs with similar params (temp=1.0, top_k=50, top_p=0.9), I currently find it difficult to definitively determine which runtime/artifact generates text that tends to loop "more" or "worse". Both have a not-insignificant amount of looping, as outputs tended to grow longer.

As the favorable framework isn't blatantly obvious to me in this case (also given the non-deterministic generation for an arbitrary trio of sensible top_k/top_p/temperature), I think the only way forward if you feel dissatisfied with my analysis would be if you provided a more rigorous, quantitative comparison result - via standard benchmark tools such as mmlu scores, etc.

How I built the TRTLLM engine, since you asked:

Checked out db4edea - the later of the two releases you mentioned building with.
Converted and built using your exact snippets above, in the original question.

Thanks for the patience.

DreamGenX · 2024-07-22T10:08:48Z

@netanel-haber Thank you for sharing your results. I will try to redo the experiments on my side -- since you can't reproduce the discrepancy, it could be that I missed some other variable between the setups.

Thanks again for your time.

Naveassaf · 2024-08-11T15:57:07Z

Closing due to inactivity. @DreamGenX , feel free to reopen/create a separate issue if the problem persist with the changes @netanel-haber suggested.

DreamGenX added the bug Something isn't working label Jun 16, 2024

nv-guomingz added the waiting for feedback label Jun 17, 2024

DreamGenX mentioned this issue Jun 30, 2024

vLLM results are better than trt with the same request #1870

Closed

4 tasks

netanel-haber self-assigned this Jul 4, 2024

Naveassaf closed this as completed Aug 11, 2024

Oseltamivir mentioned this issue Oct 30, 2024

Tokens per sample upper limit for GPTJ mlcommons/inference#1728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

DreamGenX commented Jun 16, 2024 •

edited

Loading

nv-guomingz commented Jun 17, 2024

MagicRUBICK commented Jun 24, 2024

DreamGenX commented Jun 25, 2024

ZihanLiao commented Jul 2, 2024

netanel-haber commented Jul 9, 2024

netanel-haber commented Jul 11, 2024 •

edited

Loading

DreamGenX commented Jul 11, 2024

netanel-haber commented Jul 11, 2024

DreamGenX commented Jul 11, 2024 •

edited

Loading

netanel-haber commented Jul 11, 2024

DreamGenX commented Jul 11, 2024

netanel-haber commented Jul 11, 2024

netanel-haber commented Jul 18, 2024

DreamGenX commented Jul 18, 2024

netanel-haber commented Jul 21, 2024 •

edited

Loading

DreamGenX commented Jul 22, 2024

Naveassaf commented Aug 11, 2024

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

Comments

DreamGenX commented Jun 16, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

nv-guomingz commented Jun 17, 2024

MagicRUBICK commented Jun 24, 2024

DreamGenX commented Jun 25, 2024

ZihanLiao commented Jul 2, 2024

netanel-haber commented Jul 9, 2024

netanel-haber commented Jul 11, 2024 • edited Loading

Replication Script

DreamGenX commented Jul 11, 2024

netanel-haber commented Jul 11, 2024

DreamGenX commented Jul 11, 2024 • edited Loading

netanel-haber commented Jul 11, 2024

DreamGenX commented Jul 11, 2024

netanel-haber commented Jul 11, 2024

netanel-haber commented Jul 18, 2024

TRTLLM

vLLM

DreamGenX commented Jul 18, 2024

netanel-haber commented Jul 21, 2024 • edited Loading

vLLM

TRTLLM

Conclusion

DreamGenX commented Jul 22, 2024

Naveassaf commented Aug 11, 2024

DreamGenX commented Jun 16, 2024 •

edited

Loading

netanel-haber commented Jul 11, 2024 •

edited

Loading

DreamGenX commented Jul 11, 2024 •

edited

Loading

netanel-haber commented Jul 21, 2024 •

edited

Loading