Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release #156

Closed
jozefchutka opened this issue Jun 21, 2023 · 20 comments
Closed
Labels
bug Something isn't working

Comments

@jozefchutka
Copy link

jozefchutka commented Jun 21, 2023

I have observed relatively significant degradation in transcription quality between versions 2.0.1 and 2.2.0 using automatic-speech-recognition (whisper) although using the very same configuration.

Here is an example output for the same input:

2.0.1 result
0 -> 7  We have main engine start.
7 -> 10  4, 3, 2, 1.
10 -> 13  Yeah, whoa!
23 -> 27.08  You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002  You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996  Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46  I'm not freaked out, but it's...
37.46 -> 39.36  All right, fine. I'm freaked out.
39.36 -> 42.36  I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18  Oh, what in our tongue? We're done.
50.16 -> 53.78  - Robots memory, synced, and locked.
2.2.0 result
0 -> 3  [ [ ]
3 -> 4  [
4 -> 5  ]
5 -> 7  We have main engine start
7 -> 10  4, 3, 2, 1
10 -> 11  [ ]
11 -> 10  [
10 -> 12  ]
12 -> 11  [
11 -> 12  ]
12 -> 27.36  [ ] [ (thunder rumbling) - You're a jerk, Tom. - Look, Celia, we have to follow our passions.
27.36 -> 30.72  You have your robotics and I just wanna be awesome in space.
30.72 -> 32.68  - Why don't you just admit that you're
32.68 -> 34.68  freaked out by my robot hand?
34.68 -> 36.2  - I'm not freaked out, but it's...
37.68 -> 38.68  All right, fine.
38.68 -> 39.519999999999996  I'm freaked out.
39.52 -> 40.92  I'm having nightmares that I'm being chased
40.92 -> 42.44  but he's dying robotic class.
42.44 -> 44.36  - Oh, what in our dumb?
44.36 -> 45.18  We're done.
50.18 -> 53.68  - Robots memory, synced, and locked.

These are the observed issues:

  1. lowered timestamp precision, for example You're a jerk, Tom. from 23 -> 27.08 to 12 -> 27.36
  2. words are recognized with less precision, for example from chased by these giant robotic claws to chased ... but he's dying robotic class.
  3. random appearance of [ and ] characters (although can be filtered in postprocesing)

I am wondering if these two versions use differently trained models?

Or is there any extra configuration I could pass into 2.2.0 pipeline/pipe to get the results at least matching 2.0.1 without drop in performance (I am aware of num_beams but that decreases performance heavily)

I am attaching min repro testing scripts for evaluation, keep it running 1-2 minutes and the output appears in the console.log

The worker is as simple as:

import { pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";

const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

let content = "2.2.0 result\n";
for(let {text, timestamp} of result.chunks)
	content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);
@jozefchutka jozefchutka added the bug Something isn't working label Jun 21, 2023
@kungfooman
Copy link
Contributor

I have a feeling this is due to removed mel filters in preprocessor_config.json:

https://huggingface.co/Xenova/whisper-tiny.en/commit/587f87336f46d17831cc9a2539389bef5f082537.diff?file=preprocessor_config.json

What you could do is downloading the old version with mel_filters and instruct transformers.js to load your local model - if it has mel filters, transformers.js will not attempt to recalculate them - and the recalculation is only super minuscule different than the Python implementation - and yet I saw differences aswell.

@jozefchutka
Copy link
Author

Hi @kungfooman , thanks for your quick reply.

Is there a way I only host/provide local preprocessor_config.json while the other assets are loaded from HF?

@kungfooman , @xenova how safe it is to keep local preprocessor_config.json considering future transformers.js releases? Or would it be possible to have transformer.js loading config with mel filters under some runtime/pipeline flag?

@kungfooman
Copy link
Contributor

IMO the important question right now is "why"... is there a problem just cloning Xenova/whisper-base.en and test with it locally?

You can e.g. use:

async function clearTransformersCache() {
  const tc = await caches.open("transformers-cache");
  const keys = await tc.keys();
  keys.forEach(key => tc.delete(key));
}
await clearTransformersCache();
// git clone https://huggingface.co/Xenova/whisper-tiny/
env.localModelPath = "http://127.0.0.1/transformer/models/";
const pipe = await pipeline("automatic-speech-recognition", "whisper-tiny or whatever model you want to test with");

@jozefchutka
Copy link
Author

Hm... both 2.0.1 and 2.2.0 links to https://huggingface.co/Xenova/whisper-base.en/resolve/main/preprocessor_config.json which has no mel_filters and there is no config with mel filters in repo history

@kungfooman
Copy link
Contributor

Mel filters are defined by the rest of the preprocessor_config.json and since they are the same, you can just copy other mel filters (aka just copy over entire file)

@jozefchutka
Copy link
Author

If both 2.0.1 and 2.2.0 use the same preprocessor_config.json, both without mel filters, yet the results are different, how come mel filtes are to look into? What am I missing?

@kungfooman
Copy link
Contributor

They share the same values, but come in two variants: with or without mel_filters

With "same" I refered only to the values. If the values were different, you wouldn't be able to copy the mel_filters from another file with "same" values.

So basically this is the entire issue: when they removed the mel_filters from the preprocessor_config.json, they argued it just looks ugly and isn't required anyway, because you get the "same" values by recalculating them. But this recalculation in JS is maybe 99.999999% aligned with the Python calculation and yet it differed in my test.

So just fetch a JSON with mel_filters in it (but otherwise SAME values, or your mel filters are different 😅) and see if the quality improves again.

As far I know the release of the models isn't aligned with transformers.js releases either, they should work independently of each other (and are released whenever it fits on HuggingFace).

@jozefchutka
Copy link
Author

mel_filters were not removed from whisper-base.en/preprocessor_config.json, b/c it never was there in first place. Or please show me the commit where it happened for whisper base - which is the one I am reporting this bug for https://huggingface.co/Xenova/whisper-base.en/commits/main/preprocessor_config.json .

I am having really hard times explaining that there is just no such config with mel filters for whisper-base.en. If I am doing something wrong on my side please send me a link to whisper-base.en with mel filters. I am not reporting the issue for whisper-tiny!

Considering 2.0.1 and 2.2.0 use the very same preprocessor_config.json file and there never was mel_filters in place in this file, and most likely the very same revision of whisper-base.en, what else could be the issue related to my bug report?

@kungfooman
Copy link
Contributor

I see, the other big change was this PR: #133

You can just pick commits back in history and see where it degraded.

You wouldn't even need to waste time on rebuild the /dist/ for every picked commit if you import it via ES6 + import-maps.

@xenova
Copy link
Collaborator

xenova commented Jun 21, 2023

Just to confirm, you are using the same model/onnx files for each test, right?

@jozefchutka
Copy link
Author

If you have a look at attached scripts you will notice there is no revision specified for pipeline. That makes me believe these are the very same models in use

@xenova
Copy link
Collaborator

xenova commented Jun 21, 2023

If you have a look at attached scripts you will notice there is no revision specified for pipeline. That makes me believe these are the very same models in use

If you are running in a browser environment, can you just confirm by clearing any browser cache (or trying in incognito mode)? I'll run the experiments on my side too now.

I see, the other big change was this PR: #133

I will also double check that the correct "forced_decoder_ids" are chosen.

@jozefchutka
Copy link
Author

Running in incognito mode, I can see both 2.0.1 and 2.2.0 produce the same lower quality response. So the difference is something that was cached, but in the meanwhile I managed to also delete the cache so its hard to get back into the state.

When importing import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js"; , the loaded .wasm https://cdn.jsdelivr.net/npm/@xenova/transformers/dist/ort-wasm-simd.wasm does not respect he 2.0.1 version

Not sure where to investigate further from here.

@jozefchutka
Copy link
Author

Just managed to get proper .wasm loading using

env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/';

however, still getting the lower quality transcriptions

@xenova
Copy link
Collaborator

xenova commented Jun 21, 2023

The models' onnx files were updated around 3 weeks ago: https://huggingface.co/Xenova/whisper-tiny.en/tree/main/onnx to be in line with the conversion process used by the rest of the models (which resulted in performance improvements for other seq2seq models). Specifically, this is due to the different quantization parameters used.

@jozefchutka
Copy link
Author

Bingo! When enforcing model revision https://huggingface.co/Xenova/whisper-base.en/commit/95502fc2ffd132c6859cf58a66f4977c3c6abac2

I am getting good the good results for both 2.0.1 and 2.2.0

import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js";

env.allowLocalModels = false;
env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/';

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const revision = "95502fc2ffd132c6859cf58a66f4977c3c6abac2";

const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model, {revision});
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

let content = "2.0.1 result\n";
for(let {text, timestamp} of result.chunks)
	content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);
0 -> 7  We have main engine start.
7 -> 10  4, 3, 2, 1.
10 -> 13  Yeah, whoa!
23 -> 27.08  You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002  You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996  Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46  I'm not freaked out, but it's...
37.46 -> 39.36  All right, fine. I'm freaked out.
39.36 -> 42.36  I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18  Oh, what in our tongue? We're done.
50.16 -> 53.78  - Robots memory, synced, and locked.

So I wonder if the new onnx files / conversion process can be somehow tuned to get good speed and good transcriptions at the same time?

@xenova
Copy link
Collaborator

xenova commented Jun 21, 2023

Great! Thanks so much for helping to investigate. This is the commit ec00d4f which changed the default quantization parameters in the conversion script.

This change was necessary for many text-only models, which showed significantly worse performance without these settings (both in JS and in python). This issue was first noticed a few weeks ago when GH actions started failing. Here's some example code and output to show the problem:

from optimum.onnxruntime import ORTModelForTokenClassification
import onnxruntime
from transformers import pipeline, AutoTokenizer

model_path = './models/Davlan/distilbert-base-multilingual-cased-ner-hrl'

session_options = onnxruntime.SessionOptions()

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

model_ort = ORTModelForTokenClassification.from_pretrained(
    model_path + '/onnx',
    file_name='model.onnx',
    use_io_binding=True,
    session_options=session_options
)

ort_pipeline = pipeline(
    task="token-classification",
    model=model_ort,
    tokenizer=tokenizer,
)

ARTICLE = "The Golden State Warriors are an American professional basketball team based in San Francisco."

out = ort_pipeline(ARTICLE)

print(f'{out=}')

With reduce_range=True, per_channel=True (correct)

[{'entity': 'B-ORG', 'score': 0.9998536, 'index': 2, 'word': 'Golden', 'start': 4, 'end': 10}, {'entity': 'I-ORG', 'score': 0.99986124, 'index': 3, 'word': 
'State', 'start': 11, 'end': 16}, {'entity': 'I-ORG', 'score': 0.9998661, 'index': 4, 'word': 'Warriors', 'start': 17, 'end': 25}, {'entity': 'B-LOC', 'score': 
0.9997049, 'index': 13, 'word': 'San', 'start': 80, 'end': 83}, {'entity': 'I-LOC', 'score': 0.9987282, 'index': 14, 'word': 'Francisco', 'start': 84, 'end': 93}]

With reduce_range=False, per_channel=False (incorrect):

[{'entity': 'B-ORG', 'score': 0.99973637, 'index': 2, 'word': 'Golden', 'start': 4, 'end': 10}, {'entity': 'I-ORG', 'score': 0.99928397, 'index': 3, 'word': 'State', 'start': 11, 'end': 16}]

This is due to "saturation issues" when using int8 for weights: https://docs.openvino.ai/2022.3/pot_saturation_issue.html

However, as you point out, it looks like this is not necessary for whisper models. I will do some more testing, but it might make sense to revert the model to use reduce_range=False, per_channel=False.

xenova added a commit that referenced this issue Jun 21, 2023
@xenova
Copy link
Collaborator

xenova commented Jun 21, 2023

Did additional testing and can confirm the different quantization options significantly impacted the performance! Tested on whisper-web: https://huggingface.co/spaces/Xenova/whisper-web

reduce_range=True, per_channel=True:
image

reduce_range=False, per_channel=False:
image


I've updated all whisper models: whisper-tiny, whisper-tiny.en, whisper-base, whisper-base.en, whisper-small, whisper-small.en, whisper-medium, whisper-medium.en, whisper-large, whisper-large-v2, and they are now live on the hub. I've also added a "quant_config.json" file which will help track this in the future: https://huggingface.co/Xenova/whisper-tiny/blob/main/quant_config.json

Can you please refresh cache and try again on your side? It should download these fixed models now.

@xenova
Copy link
Collaborator

xenova commented Jun 21, 2023

All's working from my side - so I'll close the issue for now. But feel free to reopen if needed.

@xenova xenova closed this as completed Jun 21, 2023
xenova added a commit that referenced this issue Jun 21, 2023
#156) (#157)

* Allow user to set `per_channel` and `reduce_range` quantization parameters (#156)

Also save quantization options

* Get operators of graph and subgraphs
@jozefchutka
Copy link
Author

Thanks for the quick fix @xenova , I can confirm models from the latest commit https://huggingface.co/Xenova/whisper-base.en/commit/86134155f8ad5593996868d4544b3d49ea0b1163 provide solid transcriptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants