-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788
Comments
what's your input of above output tokens? |
add arg for converting checkpoint step: '--rotary_base 500000.0' for llama3. |
@MagicRUBICK I believe that should be inferred from the HF config when converting checkpoint. Was that not your experience? Could you share your config.json? Here's mine when I did not use |
Facing the same issue. |
@ZihanLiao, can you elaborate please? |
Hello @DreamGenX - I've been working on replicating your issue - namely, by first running generation on HF, as you mentioned in the title.
I wrote a small script that replicates your environment, based on what I inferred from your report, see below. Maybe you can elaborate as to the exact replication needed to where Replication Scriptnvidia-docker run -v <Your Path>:/workspace/model:rw -it nvcr.io/nvidia/pytorch:24.06-py3 bash
cd ~
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/
git checkout db4edea1e1359bcfcac7bbb87c1b639b5611c721 # The commit SHA from the later release you linked to above: https://github.com/NVIDIA/TensorRT-LLM/pull/1763
cd ..
python -m pip install virtualenv
python -m virtualenv .venv
source .venv/bin/activate
python -m pip install -r TensorRT-LLM/examples/llama/requirements.txt
python script.py
import transformers
import torch
from pathlib import Path
REPETITION_PENALTY = 1.0 # This is the default repetition_penalty in transformers, and it means no penalty - see image below
model_path = Path("/workspace/llama3-70b")
pipeline = transformers.pipeline("text-generation", model=Path("/app/llama/hf"), model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
result = pipeline("Hey how are you doing today?", temperature=0.1, max_new_tokens=4096, repetition_penalty=REPETITION_PENALTY)
print(result) |
Hi @netanel-haber thank you for looking into this. I am using a custom fine tune of the Llama 3 70B instruct model. The favorable results were with vLLM, using:
To make the comparison fair I did not use samplers like min P which TensorRT-LLM does not support. |
I see - is there public access to said finetune? |
@netanel-haber I can share access if you provide your HF username. |
Sure: Nvidia-NetanelHaber Thank you. |
Awesome, shared together with some example inputs (not all will trigger repetition, with with TRT roughly 15-30% should). |
Received, thanks! |
Hey - sorry for the delay. I hope I'm not missing something trivial/critical here - I suspect the discrepancy may be due to greedy sampling when running TRTLLM. TRTLLMWhen using
vLLMvLLM, on the other hand, defaults to I ran vLLM with your fine-tune, on the sample inputs you provided, with The result showed numerous occurrences of looping similar to what you provided above. (I can provide outputs privately or publicly, if you prefer). One hesitation is because you mentioned:
Could you provide more context as to how you ran the GptManager (i.e., provide an actual snippet, if you don't mind)? Since there are many possible entry points for generation, I had trouble establishing for a fact that all of your TensorRT-LLM generations used Let me know if this makes sense to you! |
@netanel-haber No worries. I did several runs, one was |
Hey - I wanted to double-check before getting back to you. vLLM
TRTLLM
I'll upload all 5 to your private HF repo [Every output is delimited by ConclusionDiscounting the greedy generation, and just looking at the runs with similar params ( As the favorable framework isn't blatantly obvious to me in this case (also given the non-deterministic generation for an arbitrary trio of sensible top_k/top_p/temperature), I think the only way forward if you feel dissatisfied with my analysis would be if you provided a more rigorous, quantitative comparison result - via standard benchmark tools such as mmlu scores, etc. How I built the TRTLLM engine, since you asked:
Thanks for the patience. |
@netanel-haber Thank you for sharing your results. I will try to redo the experiments on my side -- since you can't reproduce the discrepancy, it could be that I missed some other variable between the setups. Thanks again for your time. |
Closing due to inactivity. @DreamGenX , feel free to reopen/create a separate issue if the problem persist with the changes @netanel-haber suggested. |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens).
The outputs from TensorRT-LLM (obtained through running the
run.py
script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below).When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns.
Here some some of the ways I tried to build the TensorRT-LLM engine:
context_fmha enable/disable
and and alsocontext_fmha_fp32_acc enable/disable
use_custom_all_reduce enable/disable
gemm_plugin auto/disable
presencePenalty
andfrequencyPenalty
(unset, 0.05, 0.1, 0.3), bust most tests were with0.1
for bothOne concrete example:
I also tried running sequentially without batching, and even building the engine with
max_batch_size 1
to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building withmax_input_len 7424
andmax_output_len 768
to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not).Expected behavior
The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch.
actual behavior
The input would usually be some part of a story + instruction to continue the story. This is an example of an output.
The repetition is usually at a sentence level like this, but sometimes also several sentences repeat.
additional notes
I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.
The text was updated successfully, but these errors were encountered: