Is my Scorer biased ? #2204
Replies: 1 comment 3 replies
-
Because it's cheating... if your models have access to the exact same sentences for the language model and the acoustic one, they will over-fit on your data and badly generalize in the real world. Ask yourself:
Separating the data for both models is the only way to make sure they cannot cheat like that. This makes the results of the tests more reliable. If you mix your data, nobody can trust the results you are getting with your models because they learned all the data but will not generalize well with new one. In essence your models will be really good on your data because they know it all but the moment you try with new data the models have never seen, they will fail to transcribe the data accurately. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
I’m using Coqui-STT for a specific domain with constrained vocabulary.
Therefore I’m trying to build my own scorer as shown in https://stt.readthedocs.io/en/latest/playbook/SCORER.html
On this page we can read :
"Preparing the text file
This is straightforward. In this example, we will use a file called indonesian-sentences.txt. This file should contain phrases that you wish to prioritize recognising. For example, you may want to recognise place names, digits or medical phrases - and you will include these phrases in the .txt file.
These phrases should not be copied from test.tsv, train.tsv or validated.tsv as you will bias the resultant model."
... I don’t understand why giving the scorer phrases from the train_set or val_set will bias the model (i think i get why phrases from the test_set will introduce bias) and I can’t find any clear explanation on the internet or Coqui documentation. Can you help me ?
Best regards
Beta Was this translation helpful? Give feedback.
All reactions