Prompt_ids vs. decoder_input_ids in Whisper #28228

vymao · 2023-12-24T03:10:27Z

Feature request

I am trying to understand the different between adding prior text to prompt_ids vs. decoder_input_ids when generating text via Whisper. The documentation is not very clear on how these differ implementation-wise; AFAIK, it seems like using prompt_ids will lead to forced_input_ids being modified here. But I'm not sure how exactly using decoder_input_ids differs from this.

Motivation

To add context to the whisper transcription. For example, if the model previously transcribed I have a in a streaming fashion, I would like to add this as "context" into the model to help it predict the next word. I believe the actual OpenAI Whisper implementation has a feature called "prefix" that does this.

Your contribution

Will try.

The text was updated successfully, but these errors were encountered:

bL34cHig0 · 2023-12-25T10:30:45Z

Hey @vymao, prompt_ids basically refers to the input tokens or token IDs provided to the model before generating text and it serves as an initial context for the model to begin generating text.

On the other hand, decoder_input_ids are mainly used in sequence-to-sequence models or in models with a decoder part. For example, Transformer architectures with encoder-decoder structure. So decoder_input_ids are inputs provided to the decoder part of a sequence-to-sequence model and they help guide the generation of subsequent tokens in the sequence.

When it comes to generating text via Whisper, both prompt_ids and decoder_input_ids can be used to provide context to guide the model's text generation. Also, the prefix feature in Whisper mainly uses either prompt_ids or decoder_input_ids or a combination of both to provide context to the model.

The implementation difference between prompt_ids and decoder_input_ids is that prompt_ids usually provide the initial context while decoder_input_ids guides the decoding or generation process, mostly when it involves encoder-decoder architectures.

@254guru

vymao · 2023-12-26T03:52:32Z

Thanks. I'm still slightly confused: when you say prompt_ids are used to provide initial context, isn't that still on the decoder side before the actual generated text? How is this different from using decoder_input_ids?

LysandreJik · 2023-12-26T20:48:32Z

Maybe for @sanchit-gandhi or @ylacombe

ylacombe · 2024-01-01T14:31:55Z

Hey @vymao, I'm not a Whisper expert yet but as I understand and as the documentation suggests, prompt_ids are created by using the tokenizer's or the processor's get_prompt_ids.

transformers/src/transformers/models/whisper/tokenization_whisper.py

Lines 827 to 836 in 3cefac1

    
           def get_prompt_ids(self, text: str, return_tensors="np"): 
        
               """Converts prompt text to IDs that can be passed to [`~WhisperForConditionalGeneration.generate`].""" 
        
               batch_encoding = self("<|startofprev|>", " " + text.strip(), add_special_tokens=False) 
        
               # Check for special tokens 
        
               prompt_text_ids = batch_encoding["input_ids"][1:] 
        
               special_token_id = next((x for x in prompt_text_ids if x >= self.all_special_ids[0]), None) 
        
               if special_token_id is not None: 
        
                   token = self.convert_ids_to_tokens(special_token_id) 
        
                   raise ValueError(f"Encountered text in the prompt corresponding to disallowed special token: {token}.")

As you can see from the code, get_prompt_ids handles the input text so you don't have to worry about special tokens that need to be inserted to tell the model that this text is context and not the start of the transcription.

Then the code processes the prompt_ids in place of the decoder_input_ids.

In other words, you can use prompt_ids obtained from get_prompt_ids if you want to pass a context to Whisper. decoder_input_ids is much more flexible: you could reproduce prompt_ids obtained from get_prompt_ids or use it to have a more advanced use of Whisper.

I hope that it helps!

cc @sanchit-gandhi or @ArthurZucker if you want to correct me or give a more advanced explanation

patrickvonplaten · 2024-01-08T17:19:31Z

Hey @vymao,

That's a very good question! In a nutshell, decoder_input_ids and prompt_ids are the same thing. The allow you to prompt Whisper on a specific prefix just like it's explained here: https://platform.openai.com/docs/guides/speech-to-text/prompting

Please use prompt_ids for the moment and don't use decoder_input_ids. I'm working on improving the docs and usability of Whisper at the moment which this PR: #27658

github-actions · 2024-02-02T08:03:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

vymao changed the title ~~Clarification on prompt_ids vs. decoder_input_ids in Whisper~~ Prompt_ids vs. decoder_input_ids in Whisper Dec 24, 2023

patrickvonplaten mentioned this issue Jan 8, 2024

[Whisper] Finalize batched SOTA long-form generation #27658

Merged

4 tasks

github-actions bot closed this as completed Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt_ids vs. decoder_input_ids in Whisper #28228

Prompt_ids vs. decoder_input_ids in Whisper #28228

vymao commented Dec 24, 2023

bL34cHig0 commented Dec 25, 2023 •

edited

Loading

vymao commented Dec 26, 2023

LysandreJik commented Dec 26, 2023

ylacombe commented Jan 1, 2024

patrickvonplaten commented Jan 8, 2024

github-actions bot commented Feb 2, 2024

Prompt_ids vs. decoder_input_ids in Whisper #28228

Prompt_ids vs. decoder_input_ids in Whisper #28228

Comments

vymao commented Dec 24, 2023

Feature request

Motivation

Your contribution

bL34cHig0 commented Dec 25, 2023 • edited Loading

vymao commented Dec 26, 2023

LysandreJik commented Dec 26, 2023

ylacombe commented Jan 1, 2024

patrickvonplaten commented Jan 8, 2024

github-actions bot commented Feb 2, 2024

bL34cHig0 commented Dec 25, 2023 •

edited

Loading