-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Top-P sampling occasionally produces invalid tokens #1590
Comments
Thank you. I can reproduce the issue. I little change the basic_example to help accelerating the reproducing. import argparse
import torch
import random
import tensorrt_llm.bindings.executor as trtllm
# This example hows to use the python bindings to create an executor, enqueue a
# request, and get the generated tokens.
# First, follow the steps in README.md to generate the engines.
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Executor Bindings Example")
parser.add_argument("--model_path",
type=str,
required=True,
help="Directory containing model engine")
args = parser.parse_args()
# Create the executor.
executor = trtllm.Executor(args.model_path, trtllm.ModelType.DECODER_ONLY,
trtllm.ExecutorConfig(1))
random.seed(1234)
if executor.can_enqueue_requests():
ite_count = 0
while True:
# Create the request.
requests = []
ite_count += 16
for _ in range(16):
input_token_ids = [random.randint(100, 10000) for _ in range(200)]
requests.append(trtllm.Request(input_token_ids=input_token_ids, max_new_tokens=105,
sampling_config=trtllm.SamplingConfig(top_p=0.5, top_k=None, temperature=20.0)))
if ite_count < 6616:
continue
# Enqueue the request.
request_ids = executor.enqueue_requests(requests)
# Wait for the new tokens.
responses = executor.await_responses(request_ids)
for idx, re in enumerate(responses):
output_tokens = re[0].result.output_token_ids[0]
valid_output = all(el >= 0 and el < 200000 for el in output_tokens)
if not valid_output:
print(f"InValid output produced for request {request_ids[idx]}.")
print(f"Output tokens : {output_tokens[200:]}")
exit(-1)
else:
print(f"Valid output produced for request {request_ids[idx]}.") We are still investigating the reason. |
Hi Alessio, |
Hi @AlessioNetti do u still have further issue or question now? If not, we'll close it soon. |
Hi - the bug has been fixed a few versions back, so we can close this. |
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
We noticed that TensorRT-LLM occasionally (~0.01% of requests) generates invalid tokens. The issue can be reproduced using a generic Falcon 7B model via the following:
The
examples/bindings/executor/example_basic.py
script was modified to issue random top-P requests (in batches of 16) until an invalid token is detected in the output. The changes are as in the following:Expected behavior
Requests should always generate valid tokens, that are in the
[0, vocabulary_size)
range.actual behavior
Occasionally, requests will produce invalid tokens that are outside of the model's vocabulary size. Below is an example of the issue under our custom
example_basic.py
script:As it can be seen, one of the tokens is
2147483647
. In other instances we have also observed negative tokens, but always in the billions range - this would suggest an integer overflow issue connected to top-P sampling logic somewhere.additional notes
0.10.0.dev2024041600
, and it is present up until0.10.0.dev2024050700
;The text was updated successfully, but these errors were encountered: