Is my Scorer biased ? #2204

NightmareJC · 2022-04-26T09:05:22Z

NightmareJC
Apr 26, 2022

Hello everyone,

I’m using Coqui-STT for a specific domain with constrained vocabulary.
Therefore I’m trying to build my own scorer as shown in https://stt.readthedocs.io/en/latest/playbook/SCORER.html
On this page we can read :

"Preparing the text file
This is straightforward. In this example, we will use a file called indonesian-sentences.txt. This file should contain phrases that you wish to prioritize recognising. For example, you may want to recognise place names, digits or medical phrases - and you will include these phrases in the .txt file.
These phrases should not be copied from test.tsv, train.tsv or validated.tsv as you will bias the resultant model."

... I don’t understand why giving the scorer phrases from the train_set or val_set will bias the model (i think i get why phrases from the test_set will introduce bias) and I can’t find any clear explanation on the internet or Coqui documentation. Can you help me ?

Best regards

wasertech · 2022-05-24T19:14:04Z

wasertech
May 24, 2022
Collaborator

Because it's cheating... if your models have access to the exact same sentences for the language model and the acoustic one, they will over-fit on your data and badly generalize in the real world.

Ask yourself:

What is it left for the model to learn if all the answers are given from the get go?

Separating the data for both models is the only way to make sure they cannot cheat like that. This makes the results of the tests more reliable.

If you mix your data, nobody can trust the results you are getting with your models because they learned all the data but will not generalize well with new one.

In essence your models will be really good on your data because they know it all but the moment you try with new data the models have never seen, they will fail to transcribe the data accurately.

3 replies

NightmareJC Jun 7, 2022
Author

Hello wasertech and thank you very much for your answer.

I'm not sure I undertand it correctly:
To my mind, as the acoustic model and the language model are not learning the same information from samples (phonemes /vs/ sequences of words) there is no bias in giving the LM sentences from the trainSet and valSet of the AM.

Let's say i split my data (audio with corresponding text files) into 4 sets A, B, C and D (with no data leakage between sets). If in my data architecture i have :
1st) the AM trained on A and validated on B and tested on C
2nd) the LM trained with a scorer made from sentences of A and B and D, tested on C
3rd) when I'm happy with my WER results (changing hyperparameters) I create a production engine trained and validated on A, B, C and D
I'm I doing it wrong ?

Best regards and thank you again for your availability.

Jérémi

wasertech Jun 7, 2022
Collaborator

I'm I doing it wrong ?

Yes but we all made this mistake.

Let's say we do it like you propose. Train an AM with set A, validate on set B and test on set C. (That's all good)
The problem arises from the fact you are giving training, evaluation and testing answers to the scorer. Add hyper-parameter optimization for it and you got a very biased scorer just for your data sets. Show it new data and it won't generalize well.

To my mind, as the acoustic model and the language model are not learning the same information from samples (phonemes /vs/ sequences of words) there is no bias in giving the LM sentences from the trainSet and valSet of the AM.

Yes but what does it all represent in the end? Sentences. If the AM and the LM have access to the exact same sentences in there respective data sets, the results you are getting are lying to you. In the real world (with real data) the models will never achieve the advertised metrics.

The network will "think" that all the sentences it will ever see are already inside the scorer (because during training, evaluation and testing; it is! - if you mixed the data -) it just has to "copy the sentence" instead of learning the real distribution for the language. It will take a shortcut instead of learning. We call this effect an overfit.

The point of separating data is to have a reliable test to share and compare models by. Not to see who's over-fitting on his data the most.

NightmareJC Jun 9, 2022
Author

Thanks again for your answer.

To be perfectly clear this is what I have :

A_audio + A_text
B_audio + B_text
C_audio + C_text
D_text (only)

I start by training my acoustic model with A (text & audio), validation with B (text & audio) and test on C (text & audio). During this process I provide it with a scorer containing only D_text (I use the argument --scorer +scorer_path). If i get it correctly there is no bias at this step as the language model is not trained with text learned by the acoustic model.

Then, to deeply evaluate our fine-tuned model we perform a new inference based on the test dataset (C_audio + C_text) using this time a new language model trained with A_text, B_text and D_text. As there is no intersection between evaluation content and language model knowledge, do you think our final results are still biased ?

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is my Scorer biased ? #2204

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is my Scorer biased ? #2204

NightmareJC Apr 26, 2022

Replies: 1 comment · 3 replies

wasertech May 24, 2022 Collaborator

NightmareJC Jun 7, 2022 Author

wasertech Jun 7, 2022 Collaborator

NightmareJC Jun 9, 2022 Author

NightmareJC
Apr 26, 2022

Replies: 1 comment 3 replies

wasertech
May 24, 2022
Collaborator

NightmareJC Jun 7, 2022
Author

wasertech Jun 7, 2022
Collaborator

NightmareJC Jun 9, 2022
Author