Generated results are different between generating with padding and single batch, with QWEN #29936

GennVa · 2024-03-28T13:14:21Z

Hi, this issue is probably a duplicate but I'm using a Qwen model.
When I do single inference, I have no problem. The outputs end correctly with the ID token 151645 (eos token).
When I use inference batching, it seems that other outputs beyond the first of the batch are "truncated", i.e. all tokens are similar to the single inference, but a final part is cut.
That's my script. Max_new_tokens is high for my outputs ( I also changed it with a bigger number).
The stopping_criteria it's for the 151645 token.
I'm using Transformers 4.38.2 .

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=True)
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'

# QWEN model
model = AutoModelForCausalLM.from_pretrained(path)
model = model_.to(self.device)
model.eval()

tokenized_inputs = tokenizer(prompts, padding=True, return_tensors="pt")

inputs = {
    "input_ids": tokenized_inputs["input_ids"].to(device, dtype=torch.long),
    "attention_mask": tokenized_inputs["attention_mask"].to(device, dtype=torch.long)
}

outputs = model.generate(**inputs, max_new_tokens=2048, stopping_criteria=stopping_criteria, temperature=0.0)

outputs = tokenizer.batch_decode(outputs, spaces_between_special_tokens=False, skip_special_tokens=True)
[...]

That's the second output of a batch of 2. 151643 is the pad token.

[151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, ..., 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643,  33363,  25,  4089,    76,    327, ........,   330,   307,    788,    330,     15,
 497,    330,   2870,    788,  61753]

In the single inference (with the absence of the pad token) there are others that complete the output.
[..., 497, 330, 2870, 788, 61753, 5212, 13473, ..., 13989, 151645]

How could I solve it?
Thanks

The text was updated successfully, but these errors were encountered:

GennVa · 2024-03-28T17:20:49Z

The problem was in stopping criteria as shown in one issue. I changed my stopping criteria, returning True only if all input_ids have the stop_token_id.
It seems to work, I don't know if there is another solution for better performance.
I won't close to leave a word on the subject.

ArthurZucker · 2024-03-30T17:25:30Z

cc @gante and @zucchini-nlp for generation!

zucchini-nlp · 2024-04-01T10:13:54Z

@GennVa hey! If I understand correctly, you are trying to implement a stopping criteria for EOS token. I am not sure how you implemented the custom stopping criteria in code snippet, but if 151645 is model's eos token from config, you do not have to pass in stopping criteria for that. The generate handles it internally, the following should be enough 🤗
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)

Second option is if your eos is not in the config and you need custom stopping criteria for each element in the batch. In that case I would recommend to install transformers from main with !pip install --upgrade git+https://github.com/huggingface/transformers.git. StoppingCriteria in version 4.38.2 stops generation for the whole batch, as soon as at least one element in the satisfies return True condition.

In version from main we have changed the StoppingCriteria to return boolean True/False for each element in the batch, so the generation goes on for the unfinished rows. Also we added an EosTokenCriteria where you can pass in the eos token id when the generation should stop (151645), which can be used like
stopping_criteria = StoppingCriteriaList([EosTokenCriteria(eos_token_id=151645)])

Let me know if this helps :)

GennVa · 2024-04-03T08:04:04Z

@zucchini-nlp thanks for your answer.
It seems that with the stopping criteria, all batch generations are stopped as soon as the function returns True (so, the eos token is only in 1 output and the other outputs are without eos, incomplete).
My only change in stopping_criteria was to check for the presence of the eos token in all outputs before giving True.

Is there a better method? (Can the EosTokenCriteria be used for this?)
Thanks.

zucchini-nlp · 2024-04-03T08:36:54Z

Hey @GennVa!

I've checked Qwen model config, and seems that the workaround you were considering might not be needed. The token "151645" is already included as an "end-of-sequence" marker in config.

That means when you use the generate function, it'll handle everything automatically. You won't need any extra stopping criteria. This code should work fine, stopping the generation process at the end of each sequence without any issues.

outputs = model.generate(**inputs, max_new_tokens=2048)

As for the other option I mentioned earlier, it's more suited for cases where you need custom stopping conditions for each sequence. But for this case, using simply generate should be just fine. Let me know if you need any more help! 😊

GennVa · 2024-04-03T16:05:13Z

Hey @zucchini-nlp thanks for all.
I tried this morning without stopping criteria and it's working fine. In my case, my model doesn't have the eos token "151645" in config.
So I did this:
model.generation_config.eos_token_id = eos_token_id

And in generation:

generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=pad_token_id)

It's now working adding also model.generation_config.pad_token_id = pad_token_id (it returns an error), and I need to set the pad_token_id in generate() to achieve a good result.

zucchini-nlp · 2024-04-03T19:24:11Z

@GennVa Can you show what error you are getting for "pad token" and share a runnable minimal script?

github-actions · 2024-04-28T08:03:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated results are different between generating with padding and single batch, with QWEN #29936

Generated results are different between generating with padding and single batch, with QWEN #29936

GennVa commented Mar 28, 2024 •

edited

Loading

GennVa commented Mar 28, 2024

ArthurZucker commented Mar 30, 2024

zucchini-nlp commented Apr 1, 2024

GennVa commented Apr 3, 2024

zucchini-nlp commented Apr 3, 2024

GennVa commented Apr 3, 2024

zucchini-nlp commented Apr 3, 2024

github-actions bot commented Apr 28, 2024

Generated results are different between generating with padding and single batch, with QWEN #29936

Generated results are different between generating with padding and single batch, with QWEN #29936

Comments

GennVa commented Mar 28, 2024 • edited Loading

GennVa commented Mar 28, 2024

ArthurZucker commented Mar 30, 2024

zucchini-nlp commented Apr 1, 2024

GennVa commented Apr 3, 2024

zucchini-nlp commented Apr 3, 2024

GennVa commented Apr 3, 2024

zucchini-nlp commented Apr 3, 2024

github-actions bot commented Apr 28, 2024

GennVa commented Mar 28, 2024 •

edited

Loading