Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt_ids feature causing repetitions and hallucinations #35603

Open
4 tasks
vchagari opened this issue Jan 10, 2025 · 4 comments
Open
4 tasks

Prompt_ids feature causing repetitions and hallucinations #35603

vchagari opened this issue Jan 10, 2025 · 4 comments
Labels

Comments

@vchagari
Copy link

System Info

System Info
Hi @sanchit-gandhi and @gante

Using Prompt Feature like it is mentioned here (#22395) causing the model output to have too many repetitions and too much of hallucinations.

I recorded an audio and gave it to the Whisper ASR model with prompt like as mentioned below.

More details:
Transformers Commit: 1c7e5e2

Test-Case: Steps how to reproduce the issue.
Audio contents: "The full name of Donald is Donald J. Trump Jr"
prompt = "Donald Duck"

model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt)
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids, num_beams=4)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]

Output: The full name of Donald is Donald J. Trump Jr. Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck

Link to the audio: https://drive.google.com/file/d/1ud-B0uepD8Sk6ArkvJdqPmFWYpCmAooi/view?usp=drive_link

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Reproduction
Test-Case: Steps how to reproduce the issue.
Audio contents: "The full name of Donald is Donald J. Trump Jr"
prompt = "Donald Duck"

model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt)
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids, num_beams=4)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]

Output: The full name of Donald is Donald J. Trump Jr. Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck Donald Duck Donald Duck Donald Duck Donald Duck Donal Donald Duck

Expected behavior

Expected behavior
It has to give either "The full name of Donald is Donald J. Trump" or "The full name of Donald is Donald Duck", not infinite no of prompt key words.

@vchagari vchagari added the bug label Jan 10, 2025
@Rocketknight1
Copy link
Member

cc @eustlb

@eustlb
Copy link
Contributor

eustlb commented Jan 10, 2025

Hey @vchagari 🤗

Thanks for providing the audio!

I'm unsure why you're reporting an issue for a commit dating from 2023. Using last version of transformers:

pip install -U transformers 

it returns The full name of Donald is Donald J. Trump, as expected.
Moreover, you're reproducer is incomplete, a correct one would be:

from transformers import WhisperForConditionalGeneration, WhisperFeatureExtractor, WhisperProcessor
import librosa

model_dir = "openai/whisper-large-v3"
prompt = "Donald Duck"

audio, sr = librosa.load("prompt_ids_test.wav", sr=16000)

model = WhisperForConditionalGeneration.from_pretrained(model_dir).to("cuda")
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)
prompt_ids = processor.get_prompt_ids(prompt, return_tensors="pt").to("cuda")
input_features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to("cuda"), prompt_ids=prompt_ids)
text = [processor.decode(predicted_id, skip_special_tokens=True) for predicted_id in predicted_ids]
transcript = text[0]

@vchagari
Copy link
Author

@eustlb: Thanks for your response. That's the commit id i used to develop my application. Upgrading to the latest transformers, causing compatibility issues on my end, Hence i froze my transformers version.

@vchagari
Copy link
Author

vchagari commented Jan 27, 2025

I am in process of upgrading it, will share more info asap, thank you. I do see repetitions and hallucinations though with the newer version (4.43.1) as well when i use the prompt feature. My use-case is long form transcription.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants