Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated results are different between generating with padding and single batch, with QWEN #29936

Closed
GennVa opened this issue Mar 28, 2024 · 8 comments

Comments

@GennVa
Copy link

GennVa commented Mar 28, 2024

Hi, this issue is probably a duplicate but I'm using a Qwen model.
When I do single inference, I have no problem. The outputs end correctly with the ID token 151645 (eos token).
When I use inference batching, it seems that other outputs beyond the first of the batch are "truncated", i.e. all tokens are similar to the single inference, but a final part is cut.
That's my script. Max_new_tokens is high for my outputs ( I also changed it with a bigger number).
The stopping_criteria it's for the 151645 token.
I'm using Transformers 4.38.2 .

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=True)
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'

# QWEN model
model = AutoModelForCausalLM.from_pretrained(path)
model = model_.to(self.device)
model.eval()

tokenized_inputs = tokenizer(prompts, padding=True, return_tensors="pt")

inputs = {
    "input_ids": tokenized_inputs["input_ids"].to(device, dtype=torch.long),
    "attention_mask": tokenized_inputs["attention_mask"].to(device, dtype=torch.long)
}

outputs = model.generate(**inputs, max_new_tokens=2048, stopping_criteria=stopping_criteria, temperature=0.0)

outputs = tokenizer.batch_decode(outputs, spaces_between_special_tokens=False, skip_special_tokens=True)
[...]

That's the second output of a batch of 2. 151643 is the pad token.

[151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, ..., 151643, 151643, 151643, 151643, 151643, 151643,
 151643, 151643, 151643, 151643,  33363,  25,  4089,    76,    327, ........,   330,   307,    788,    330,     15,
 497,    330,   2870,    788,  61753]

In the single inference (with the absence of the pad token) there are others that complete the output.
[..., 497, 330, 2870, 788, 61753, 5212, 13473, ..., 13989, 151645]

How could I solve it?
Thanks

@GennVa
Copy link
Author

GennVa commented Mar 28, 2024

The problem was in stopping criteria as shown in one issue. I changed my stopping criteria, returning True only if all input_ids have the stop_token_id.
It seems to work, I don't know if there is another solution for better performance.
I won't close to leave a word on the subject.

@ArthurZucker
Copy link
Collaborator

cc @gante and @zucchini-nlp for generation!

@zucchini-nlp
Copy link
Member

@GennVa hey! If I understand correctly, you are trying to implement a stopping criteria for EOS token. I am not sure how you implemented the custom stopping criteria in code snippet, but if 151645 is model's eos token from config, you do not have to pass in stopping criteria for that. The generate handles it internally, the following should be enough 🤗
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)

Second option is if your eos is not in the config and you need custom stopping criteria for each element in the batch. In that case I would recommend to install transformers from main with !pip install --upgrade git+https://github.com/huggingface/transformers.git. StoppingCriteria in version 4.38.2 stops generation for the whole batch, as soon as at least one element in the satisfies return True condition.

In version from main we have changed the StoppingCriteria to return boolean True/False for each element in the batch, so the generation goes on for the unfinished rows. Also we added an EosTokenCriteria where you can pass in the eos token id when the generation should stop (151645), which can be used like
stopping_criteria = StoppingCriteriaList([EosTokenCriteria(eos_token_id=151645)])

Let me know if this helps :)

@GennVa
Copy link
Author

GennVa commented Apr 3, 2024

@zucchini-nlp thanks for your answer.
It seems that with the stopping criteria, all batch generations are stopped as soon as the function returns True (so, the eos token is only in 1 output and the other outputs are without eos, incomplete).
My only change in stopping_criteria was to check for the presence of the eos token in all outputs before giving True.

Is there a better method? (Can the EosTokenCriteria be used for this?)
Thanks.

@zucchini-nlp
Copy link
Member

Hey @GennVa!

I've checked Qwen model config, and seems that the workaround you were considering might not be needed. The token "151645" is already included as an "end-of-sequence" marker in config.

That means when you use the generate function, it'll handle everything automatically. You won't need any extra stopping criteria. This code should work fine, stopping the generation process at the end of each sequence without any issues.

outputs = model.generate(**inputs, max_new_tokens=2048)

As for the other option I mentioned earlier, it's more suited for cases where you need custom stopping conditions for each sequence. But for this case, using simply generate should be just fine. Let me know if you need any more help! 😊

@GennVa
Copy link
Author

GennVa commented Apr 3, 2024

Hey @zucchini-nlp thanks for all.
I tried this morning without stopping criteria and it's working fine. In my case, my model doesn't have the eos token "151645" in config.
So I did this:
model.generation_config.eos_token_id = eos_token_id

And in generation:

generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=pad_token_id)

It's now working adding also model.generation_config.pad_token_id = pad_token_id (it returns an error), and I need to set the pad_token_id in generate() to achieve a good result.

@zucchini-nlp
Copy link
Member

@GennVa Can you show what error you are getting for "pad token" and share a runnable minimal script?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants