[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156

jozefchutka · 2023-06-21T06:58:02Z

I have observed relatively significant degradation in transcription quality between versions 2.0.1 and 2.2.0 using automatic-speech-recognition (whisper) although using the very same configuration.

Here is an example output for the same input:

2.0.1 result
0 -> 7  We have main engine start.
7 -> 10  4, 3, 2, 1.
10 -> 13  Yeah, whoa!
23 -> 27.08  You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002  You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996  Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46  I'm not freaked out, but it's...
37.46 -> 39.36  All right, fine. I'm freaked out.
39.36 -> 42.36  I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18  Oh, what in our tongue? We're done.
50.16 -> 53.78  - Robots memory, synced, and locked.

2.2.0 result
0 -> 3  [ [ ]
3 -> 4  [
4 -> 5  ]
5 -> 7  We have main engine start
7 -> 10  4, 3, 2, 1
10 -> 11  [ ]
11 -> 10  [
10 -> 12  ]
12 -> 11  [
11 -> 12  ]
12 -> 27.36  [ ] [ (thunder rumbling) - You're a jerk, Tom. - Look, Celia, we have to follow our passions.
27.36 -> 30.72  You have your robotics and I just wanna be awesome in space.
30.72 -> 32.68  - Why don't you just admit that you're
32.68 -> 34.68  freaked out by my robot hand?
34.68 -> 36.2  - I'm not freaked out, but it's...
37.68 -> 38.68  All right, fine.
38.68 -> 39.519999999999996  I'm freaked out.
39.52 -> 40.92  I'm having nightmares that I'm being chased
40.92 -> 42.44  but he's dying robotic class.
42.44 -> 44.36  - Oh, what in our dumb?
44.36 -> 45.18  We're done.
50.18 -> 53.68  - Robots memory, synced, and locked.

These are the observed issues:

lowered timestamp precision, for example You're a jerk, Tom. from 23 -> 27.08 to 12 -> 27.36
words are recognized with less precision, for example from chased by these giant robotic claws to chased ... but he's dying robotic class.
random appearance of [ and ] characters (although can be filtered in postprocesing)

I am wondering if these two versions use differently trained models?

Or is there any extra configuration I could pass into 2.2.0 pipeline/pipe to get the results at least matching 2.0.1 without drop in performance (I am aware of num_beams but that decreases performance heavily)

I am attaching min repro testing scripts for evaluation, keep it running 1-2 minutes and the output appears in the console.log

The worker is as simple as:

import { pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";

const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

let content = "2.2.0 result\n";
for(let {text, timestamp} of result.chunks)
	content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);

The text was updated successfully, but these errors were encountered:

kungfooman · 2023-06-21T07:53:06Z

I have a feeling this is due to removed mel filters in preprocessor_config.json:

https://huggingface.co/Xenova/whisper-tiny.en/commit/587f87336f46d17831cc9a2539389bef5f082537.diff?file=preprocessor_config.json

What you could do is downloading the old version with mel_filters and instruct transformers.js to load your local model - if it has mel filters, transformers.js will not attempt to recalculate them - and the recalculation is only super minuscule different than the Python implementation - and yet I saw differences aswell.

jozefchutka · 2023-06-21T08:27:57Z

Hi @kungfooman , thanks for your quick reply.

Is there a way I only host/provide local preprocessor_config.json while the other assets are loaded from HF?

@kungfooman , @xenova how safe it is to keep local preprocessor_config.json considering future transformers.js releases? Or would it be possible to have transformer.js loading config with mel filters under some runtime/pipeline flag?

kungfooman · 2023-06-21T08:41:16Z

IMO the important question right now is "why"... is there a problem just cloning Xenova/whisper-base.en and test with it locally?

You can e.g. use:

async function clearTransformersCache() {
  const tc = await caches.open("transformers-cache");
  const keys = await tc.keys();
  keys.forEach(key => tc.delete(key));
}
await clearTransformersCache();
// git clone https://huggingface.co/Xenova/whisper-tiny/
env.localModelPath = "http://127.0.0.1/transformer/models/";
const pipe = await pipeline("automatic-speech-recognition", "whisper-tiny or whatever model you want to test with");

jozefchutka · 2023-06-21T09:03:25Z

Hm... both 2.0.1 and 2.2.0 links to https://huggingface.co/Xenova/whisper-base.en/resolve/main/preprocessor_config.json which has no mel_filters and there is no config with mel filters in repo history

kungfooman · 2023-06-21T09:11:15Z

Mel filters are defined by the rest of the preprocessor_config.json and since they are the same, you can just copy other mel filters (aka just copy over entire file)

jozefchutka · 2023-06-21T09:20:08Z

If both 2.0.1 and 2.2.0 use the same preprocessor_config.json, both without mel filters, yet the results are different, how come mel filtes are to look into? What am I missing?

kungfooman · 2023-06-21T09:43:57Z

They share the same values, but come in two variants: with or without mel_filters

With "same" I refered only to the values. If the values were different, you wouldn't be able to copy the mel_filters from another file with "same" values.

So basically this is the entire issue: when they removed the mel_filters from the preprocessor_config.json, they argued it just looks ugly and isn't required anyway, because you get the "same" values by recalculating them. But this recalculation in JS is maybe 99.999999% aligned with the Python calculation and yet it differed in my test.

So just fetch a JSON with mel_filters in it (but otherwise SAME values, or your mel filters are different 😅) and see if the quality improves again.

As far I know the release of the models isn't aligned with transformers.js releases either, they should work independently of each other (and are released whenever it fits on HuggingFace).

jozefchutka · 2023-06-21T09:57:25Z

mel_filters were not removed from whisper-base.en/preprocessor_config.json, b/c it never was there in first place. Or please show me the commit where it happened for whisper base - which is the one I am reporting this bug for https://huggingface.co/Xenova/whisper-base.en/commits/main/preprocessor_config.json .

I am having really hard times explaining that there is just no such config with mel filters for whisper-base.en. If I am doing something wrong on my side please send me a link to whisper-base.en with mel filters. I am not reporting the issue for whisper-tiny!

Considering 2.0.1 and 2.2.0 use the very same preprocessor_config.json file and there never was mel_filters in place in this file, and most likely the very same revision of whisper-base.en, what else could be the issue related to my bug report?

kungfooman · 2023-06-21T10:16:54Z

I see, the other big change was this PR: #133

You can just pick commits back in history and see where it degraded.

You wouldn't even need to waste time on rebuild the /dist/ for every picked commit if you import it via ES6 + import-maps.

xenova · 2023-06-21T11:29:42Z

Just to confirm, you are using the same model/onnx files for each test, right?

jozefchutka · 2023-06-21T11:57:03Z

If you have a look at attached scripts you will notice there is no revision specified for pipeline. That makes me believe these are the very same models in use

xenova · 2023-06-21T12:09:24Z

If you have a look at attached scripts you will notice there is no revision specified for pipeline. That makes me believe these are the very same models in use

If you are running in a browser environment, can you just confirm by clearing any browser cache (or trying in incognito mode)? I'll run the experiments on my side too now.

I see, the other big change was this PR: #133

I will also double check that the correct "forced_decoder_ids" are chosen.

jozefchutka · 2023-06-21T12:48:13Z

Running in incognito mode, I can see both 2.0.1 and 2.2.0 produce the same lower quality response. So the difference is something that was cached, but in the meanwhile I managed to also delete the cache so its hard to get back into the state.

When importing import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js"; , the loaded .wasm https://cdn.jsdelivr.net/npm/@xenova/transformers/dist/ort-wasm-simd.wasm does not respect he 2.0.1 version

Not sure where to investigate further from here.

jozefchutka · 2023-06-21T12:54:25Z

Just managed to get proper .wasm loading using

env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/';

however, still getting the lower quality transcriptions

xenova · 2023-06-21T12:54:38Z

The models' onnx files were updated around 3 weeks ago: https://huggingface.co/Xenova/whisper-tiny.en/tree/main/onnx to be in line with the conversion process used by the rest of the models (which resulted in performance improvements for other seq2seq models). Specifically, this is due to the different quantization parameters used.

jozefchutka · 2023-06-21T13:08:34Z

Bingo! When enforcing model revision https://huggingface.co/Xenova/whisper-base.en/commit/95502fc2ffd132c6859cf58a66f4977c3c6abac2

I am getting good the good results for both 2.0.1 and 2.2.0

import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";

env.allowLocalModels = false;
env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/';

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const revision = "95502fc2ffd132c6859cf58a66f4977c3c6abac2";

const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model, {revision});
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

let content = "2.0.1 result\n";
for(let {text, timestamp} of result.chunks)
	content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);

0 -> 7  We have main engine start.
7 -> 10  4, 3, 2, 1.
10 -> 13  Yeah, whoa!
23 -> 27.08  You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002  You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996  Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46  I'm not freaked out, but it's...
37.46 -> 39.36  All right, fine. I'm freaked out.
39.36 -> 42.36  I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18  Oh, what in our tongue? We're done.
50.16 -> 53.78  - Robots memory, synced, and locked.

So I wonder if the new onnx files / conversion process can be somehow tuned to get good speed and good transcriptions at the same time?

xenova · 2023-06-21T13:19:18Z

Great! Thanks so much for helping to investigate. This is the commit ec00d4f which changed the default quantization parameters in the conversion script.

This change was necessary for many text-only models, which showed significantly worse performance without these settings (both in JS and in python). This issue was first noticed a few weeks ago when GH actions started failing. Here's some example code and output to show the problem:

from optimum.onnxruntime import ORTModelForTokenClassification
import onnxruntime
from transformers import pipeline, AutoTokenizer

model_path = './models/Davlan/distilbert-base-multilingual-cased-ner-hrl'

session_options = onnxruntime.SessionOptions()

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

model_ort = ORTModelForTokenClassification.from_pretrained(
    model_path + '/onnx',
    file_name='model.onnx',
    use_io_binding=True,
    session_options=session_options
)

ort_pipeline = pipeline(
    task="token-classification",
    model=model_ort,
    tokenizer=tokenizer,
)

ARTICLE = "The Golden State Warriors are an American professional basketball team based in San Francisco."

out = ort_pipeline(ARTICLE)

print(f'{out=}')

With reduce_range=True, per_channel=True (correct)

[{'entity': 'B-ORG', 'score': 0.9998536, 'index': 2, 'word': 'Golden', 'start': 4, 'end': 10}, {'entity': 'I-ORG', 'score': 0.99986124, 'index': 3, 'word': 
'State', 'start': 11, 'end': 16}, {'entity': 'I-ORG', 'score': 0.9998661, 'index': 4, 'word': 'Warriors', 'start': 17, 'end': 25}, {'entity': 'B-LOC', 'score': 
0.9997049, 'index': 13, 'word': 'San', 'start': 80, 'end': 83}, {'entity': 'I-LOC', 'score': 0.9987282, 'index': 14, 'word': 'Francisco', 'start': 84, 'end': 93}]

With reduce_range=False, per_channel=False (incorrect):

[{'entity': 'B-ORG', 'score': 0.99973637, 'index': 2, 'word': 'Golden', 'start': 4, 'end': 10}, {'entity': 'I-ORG', 'score': 0.99928397, 'index': 3, 'word': 'State', 'start': 11, 'end': 16}]

This is due to "saturation issues" when using int8 for weights: https://docs.openvino.ai/2022.3/pot_saturation_issue.html

However, as you point out, it looks like this is not necessary for whisper models. I will do some more testing, but it might make sense to revert the model to use reduce_range=False, per_channel=False.

…eters (#156) Also save quantization options

xenova · 2023-06-21T17:40:36Z

Did additional testing and can confirm the different quantization options significantly impacted the performance! Tested on whisper-web: https://huggingface.co/spaces/Xenova/whisper-web

reduce_range=True, per_channel=True:

reduce_range=False, per_channel=False:

I've updated all whisper models: whisper-tiny, whisper-tiny.en, whisper-base, whisper-base.en, whisper-small, whisper-small.en, whisper-medium, whisper-medium.en, whisper-large, whisper-large-v2, and they are now live on the hub. I've also added a "quant_config.json" file which will help track this in the future: https://huggingface.co/Xenova/whisper-tiny/blob/main/quant_config.json

Can you please refresh cache and try again on your side? It should download these fixed models now.

xenova · 2023-06-21T20:53:34Z

All's working from my side - so I'll close the issue for now. But feel free to reopen if needed.

#156) (#157) * Allow user to set `per_channel` and `reduce_range` quantization parameters (#156) Also save quantization options * Get operators of graph and subgraphs

jozefchutka · 2023-06-22T06:28:44Z

Thanks for the quick fix @xenova , I can confirm models from the latest commit https://huggingface.co/Xenova/whisper-base.en/commit/86134155f8ad5593996868d4544b3d49ea0b1163 provide solid transcriptions.

jozefchutka added the bug Something isn't working label Jun 21, 2023

xenova added a commit that referenced this issue Jun 21, 2023

Allow user to set per_channel and reduce_range quantization param…

c3a0ea0

…eters (#156) Also save quantization options

xenova closed this as completed Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156

[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156

jozefchutka commented Jun 21, 2023 •

edited

Loading

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

xenova commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

xenova commented Jun 21, 2023 •

edited

Loading

jozefchutka commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

xenova commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

xenova commented Jun 21, 2023 •

edited

Loading

xenova commented Jun 21, 2023

xenova commented Jun 21, 2023

jozefchutka commented Jun 22, 2023

[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156

[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156

Comments

jozefchutka commented Jun 21, 2023 • edited Loading

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

kungfooman commented Jun 21, 2023

xenova commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

xenova commented Jun 21, 2023 • edited Loading

jozefchutka commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

xenova commented Jun 21, 2023

jozefchutka commented Jun 21, 2023

xenova commented Jun 21, 2023 • edited Loading

xenova commented Jun 21, 2023

xenova commented Jun 21, 2023

jozefchutka commented Jun 22, 2023

jozefchutka commented Jun 21, 2023 •

edited

Loading

xenova commented Jun 21, 2023 •

edited

Loading

xenova commented Jun 21, 2023 •

edited

Loading