Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt_ids vs. decoder_input_ids in Whisper #28228

Closed
vymao opened this issue Dec 24, 2023 · 6 comments
Closed

Prompt_ids vs. decoder_input_ids in Whisper #28228

vymao opened this issue Dec 24, 2023 · 6 comments

Comments

@vymao
Copy link

vymao commented Dec 24, 2023

Feature request

I am trying to understand the different between adding prior text to prompt_ids vs. decoder_input_ids when generating text via Whisper. The documentation is not very clear on how these differ implementation-wise; AFAIK, it seems like using prompt_ids will lead to forced_input_ids being modified here. But I'm not sure how exactly using decoder_input_ids differs from this.

Motivation

To add context to the whisper transcription. For example, if the model previously transcribed I have a in a streaming fashion, I would like to add this as "context" into the model to help it predict the next word. I believe the actual OpenAI Whisper implementation has a feature called "prefix" that does this.

Your contribution

Will try.

@vymao vymao changed the title Clarification on prompt_ids vs. decoder_input_ids in Whisper Prompt_ids vs. decoder_input_ids in Whisper Dec 24, 2023
@bL34cHig0
Copy link

bL34cHig0 commented Dec 25, 2023

Hey @vymao, prompt_ids basically refers to the input tokens or token IDs provided to the model before generating text and it serves as an initial context for the model to begin generating text.

On the other hand, decoder_input_ids are mainly used in sequence-to-sequence models or in models with a decoder part. For example, Transformer architectures with encoder-decoder structure. So decoder_input_ids are inputs provided to the decoder part of a sequence-to-sequence model and they help guide the generation of subsequent tokens in the sequence.

When it comes to generating text via Whisper, both prompt_ids and decoder_input_ids can be used to provide context to guide the model's text generation. Also, the prefix feature in Whisper mainly uses either prompt_ids or decoder_input_ids or a combination of both to provide context to the model.

The implementation difference between prompt_ids and decoder_input_ids is that prompt_ids usually provide the initial context while decoder_input_ids guides the decoding or generation process, mostly when it involves encoder-decoder architectures.

@254guru

@vymao
Copy link
Author

vymao commented Dec 26, 2023

Thanks. I'm still slightly confused: when you say prompt_ids are used to provide initial context, isn't that still on the decoder side before the actual generated text? How is this different from using decoder_input_ids?

@LysandreJik
Copy link
Member

Maybe for @sanchit-gandhi or @ylacombe

@ylacombe
Copy link
Contributor

ylacombe commented Jan 1, 2024

Hey @vymao, I'm not a Whisper expert yet but as I understand and as the documentation suggests, prompt_ids are created by using the tokenizer's or the processor's get_prompt_ids.

def get_prompt_ids(self, text: str, return_tensors="np"):
"""Converts prompt text to IDs that can be passed to [`~WhisperForConditionalGeneration.generate`]."""
batch_encoding = self("<|startofprev|>", " " + text.strip(), add_special_tokens=False)
# Check for special tokens
prompt_text_ids = batch_encoding["input_ids"][1:]
special_token_id = next((x for x in prompt_text_ids if x >= self.all_special_ids[0]), None)
if special_token_id is not None:
token = self.convert_ids_to_tokens(special_token_id)
raise ValueError(f"Encountered text in the prompt corresponding to disallowed special token: {token}.")

As you can see from the code, get_prompt_ids handles the input text so you don't have to worry about special tokens that need to be inserted to tell the model that this text is context and not the start of the transcription.

Then the code processes the prompt_ids in place of the decoder_input_ids.

In other words, you can use prompt_ids obtained from get_prompt_ids if you want to pass a context to Whisper. decoder_input_ids is much more flexible: you could reproduce prompt_ids obtained from get_prompt_ids or use it to have a more advanced use of Whisper.

I hope that it helps!

cc @sanchit-gandhi or @ArthurZucker if you want to correct me or give a more advanced explanation

@patrickvonplaten
Copy link
Contributor

Hey @vymao,

That's a very good question! In a nutshell, decoder_input_ids and prompt_ids are the same thing. The allow you to prompt Whisper on a specific prefix just like it's explained here: https://platform.openai.com/docs/guides/speech-to-text/prompting

Please use prompt_ids for the moment and don't use decoder_input_ids. I'm working on improving the docs and usability of Whisper at the moment which this PR: #27658

Copy link

github-actions bot commented Feb 2, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants