-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156
Comments
I have a feeling this is due to removed mel filters in What you could do is downloading the old version with |
Hi @kungfooman , thanks for your quick reply. Is there a way I only host/provide local preprocessor_config.json while the other assets are loaded from HF? @kungfooman , @xenova how safe it is to keep local preprocessor_config.json considering future transformers.js releases? Or would it be possible to have transformer.js loading config with mel filters under some runtime/pipeline flag? |
IMO the important question right now is "why"... is there a problem just cloning Xenova/whisper-base.en and test with it locally? You can e.g. use: async function clearTransformersCache() {
const tc = await caches.open("transformers-cache");
const keys = await tc.keys();
keys.forEach(key => tc.delete(key));
}
await clearTransformersCache();
// git clone https://huggingface.co/Xenova/whisper-tiny/
env.localModelPath = "http://127.0.0.1/transformer/models/";
const pipe = await pipeline("automatic-speech-recognition", "whisper-tiny or whatever model you want to test with"); |
Hm... both 2.0.1 and 2.2.0 links to https://huggingface.co/Xenova/whisper-base.en/resolve/main/preprocessor_config.json which has no |
Mel filters are defined by the rest of the |
If both 2.0.1 and 2.2.0 use the same preprocessor_config.json, both without mel filters, yet the results are different, how come mel filtes are to look into? What am I missing? |
They share the same values, but come in two variants: with or without With "same" I refered only to the values. If the values were different, you wouldn't be able to copy the So basically this is the entire issue: when they removed the So just fetch a JSON with mel_filters in it (but otherwise SAME values, or your mel filters are different 😅) and see if the quality improves again. As far I know the release of the models isn't aligned with transformers.js releases either, they should work independently of each other (and are released whenever it fits on HuggingFace). |
I am having really hard times explaining that there is just no such config with mel filters for whisper-base.en. If I am doing something wrong on my side please send me a link to whisper-base.en with mel filters. I am not reporting the issue for whisper-tiny! Considering 2.0.1 and 2.2.0 use the very same |
I see, the other big change was this PR: #133 You can just pick commits back in history and see where it degraded. You wouldn't even need to waste time on rebuild the /dist/ for every picked commit if you import it via ES6 + import-maps. |
Just to confirm, you are using the same model/onnx files for each test, right? |
If you have a look at attached scripts you will notice there is no |
If you are running in a browser environment, can you just confirm by clearing any browser cache (or trying in incognito mode)? I'll run the experiments on my side too now.
I will also double check that the correct "forced_decoder_ids" are chosen. |
Running in incognito mode, I can see both 2.0.1 and 2.2.0 produce the same lower quality response. So the difference is something that was cached, but in the meanwhile I managed to also delete the cache so its hard to get back into the state. When importing Not sure where to investigate further from here. |
Just managed to get proper .wasm loading using env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/'; however, still getting the lower quality transcriptions |
The models' onnx files were updated around 3 weeks ago: https://huggingface.co/Xenova/whisper-tiny.en/tree/main/onnx to be in line with the conversion process used by the rest of the models (which resulted in performance improvements for other seq2seq models). Specifically, this is due to the different quantization parameters used. |
Bingo! When enforcing model revision https://huggingface.co/Xenova/whisper-base.en/commit/95502fc2ffd132c6859cf58a66f4977c3c6abac2 I am getting good the good results for both 2.0.1 and 2.2.0 import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";
env.allowLocalModels = false;
env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/';
const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const revision = "95502fc2ffd132c6859cf58a66f4977c3c6abac2";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model, {revision});
const result = await pipe(buffer, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: true});
let content = "2.0.1 result\n";
for(let {text, timestamp} of result.chunks)
content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);
So I wonder if the new onnx files / conversion process can be somehow tuned to get good speed and good transcriptions at the same time? |
Great! Thanks so much for helping to investigate. This is the commit ec00d4f which changed the default quantization parameters in the conversion script. This change was necessary for many text-only models, which showed significantly worse performance without these settings (both in JS and in python). This issue was first noticed a few weeks ago when GH actions started failing. Here's some example code and output to show the problem: from optimum.onnxruntime import ORTModelForTokenClassification
import onnxruntime
from transformers import pipeline, AutoTokenizer
model_path = './models/Davlan/distilbert-base-multilingual-cased-ner-hrl'
session_options = onnxruntime.SessionOptions()
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model_ort = ORTModelForTokenClassification.from_pretrained(
model_path + '/onnx',
file_name='model.onnx',
use_io_binding=True,
session_options=session_options
)
ort_pipeline = pipeline(
task="token-classification",
model=model_ort,
tokenizer=tokenizer,
)
ARTICLE = "The Golden State Warriors are an American professional basketball team based in San Francisco."
out = ort_pipeline(ARTICLE)
print(f'{out=}') With reduce_range=True, per_channel=True (correct)
With reduce_range=False, per_channel=False (incorrect):
This is due to "saturation issues" when using int8 for weights: https://docs.openvino.ai/2022.3/pot_saturation_issue.html However, as you point out, it looks like this is not necessary for whisper models. I will do some more testing, but it might make sense to revert the model to use reduce_range=False, per_channel=False. |
…eters (#156) Also save quantization options
Did additional testing and can confirm the different quantization options significantly impacted the performance! Tested on whisper-web: https://huggingface.co/spaces/Xenova/whisper-web reduce_range=True, per_channel=True: reduce_range=False, per_channel=False: I've updated all whisper models: whisper-tiny, whisper-tiny.en, whisper-base, whisper-base.en, whisper-small, whisper-small.en, whisper-medium, whisper-medium.en, whisper-large, whisper-large-v2, and they are now live on the hub. I've also added a "quant_config.json" file which will help track this in the future: https://huggingface.co/Xenova/whisper-tiny/blob/main/quant_config.json Can you please refresh cache and try again on your side? It should download these fixed models now. |
All's working from my side - so I'll close the issue for now. But feel free to reopen if needed. |
Thanks for the quick fix @xenova , I can confirm models from the latest commit https://huggingface.co/Xenova/whisper-base.en/commit/86134155f8ad5593996868d4544b3d49ea0b1163 provide solid transcriptions. |
I have observed relatively significant degradation in transcription quality between versions 2.0.1 and 2.2.0 using
automatic-speech-recognition
(whisper) although using the very same configuration.Here is an example output for the same input:
These are the observed issues:
You're a jerk, Tom.
from23 -> 27.08
to12 -> 27.36
chased by these giant robotic claws
tochased ... but he's dying robotic class.
[
and]
characters (although can be filtered in postprocesing)I am wondering if these two versions use differently trained models?
Or is there any extra configuration I could pass into 2.2.0 pipeline/pipe to get the results at least matching 2.0.1 without drop in performance (I am aware of
num_beams
but that decreases performance heavily)I am attaching min repro testing scripts for evaluation, keep it running 1-2 minutes and the output appears in the
console.log
The worker is as simple as:
The text was updated successfully, but these errors were encountered: