You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Following issues #29869 and #30407, I've tried to reproduce the errors mentioned and identified two problems:
First, we get errors if the main and assistant models don't share the same encoder (for example with whisper-large-v2 and whisper-tiny) and we only load the decoder part of the assistant with AutoModelForCausalLM.
In this case we could throw an error and suggest to the user to use AutoModelForSpeechSeq2Seq instead to load both encoders and decoders.
Second: Only the pipeline seems broken when using different sized whisper models:
Here's a code for reproducing the error:
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, AutomaticSpeechRecognitionPipeline, AutoModelForCausalLM
# load data to test
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:1]")
sample = dataset[0]
# load base model
model_id = "openai/whisper-large-v2"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
low_cpu_mem_usage=True,
use_safetensors=True,
)
# load tiny version of model from same origin (openai)
assistant_tiny_model_id = "openai/whisper-tiny"
assistant_direct_tiny_model = AutoModelForSpeechSeq2Seq.from_pretrained(
assistant_tiny_model_id,
low_cpu_mem_usage=True,
use_safetensors=True,
)
inputs = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt")
output = model.generate(**inputs, assistant_model=assistant_direct_tiny_model, language="en")
print(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])
# load pipeline for base model
pipe = AutomaticSpeechRecognitionPipeline(
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
generate_kwargs={"language":"en"},
)
inputs = {
"sampling_rate": sample["audio"]["sampling_rate"],
"raw": np.array(sample["audio"]["array"]),
}
output = pipe(inputs=inputs, generate_kwargs={"assistant_model":assistant_direct_tiny_model})["text"]
print(processor.tokenizer.normalize(output))
It will work with model.generate, but not anymore when using the pipeline. I think the problem comes from the way the inputs are provided to the generate method when using the pipeline. I'll open a PR to fix this.
Expected behavior
ValueError: Whisper expects the mel input features to be of length 3000, but found 1500. Make sure to pad the input mel features to 3000.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.41.0.dev0Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Following issues #29869 and #30407, I've tried to reproduce the errors mentioned and identified two problems:
whisper-large-v2
andwhisper-tiny
) and we only load the decoder part of the assistant withAutoModelForCausalLM
.In this case we could throw an error and suggest to the user to use
AutoModelForSpeechSeq2Seq
instead to load both encoders and decoders.Here's a code for reproducing the error:
It will work with
model.generate
, but not anymore when using the pipeline. I think the problem comes from the way the inputs are provided to thegenerate
method when using the pipeline. I'll open a PR to fix this.Expected behavior
ValueError: Whisper expects the mel input features to be of length 3000, but found 1500. Make sure to pad the input mel features to 3000.
The text was updated successfully, but these errors were encountered: