You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I can see, the reported.tsv is not taken into account anywhere in the workflow.
If a sentence gets reported it is logged into reported.tsv
It continues to be shown to users and keeps getting new recordings
These recordings keep being shown in Listen/validation and they can get validated/invalidated
CorporaCreator does not take these (reported.tsv) into account and while creating the datasets they go into validated.tsv if they have enough votes, thus might go into train/dev/test sets, thus into training.
I examined 240 records in v7.0 Turkish dataset and found these:
As you can see 22% were OK, 40% can be corrected and only 38% is rightly rejected. Things like "offensive language/slang" or "political" can be very subjective. When you analyze the rightfully reported ones, they are indeed wrong grammar, OCR mistakes or heavy use of foreign names.
Taking the size of the whole dataset, I think leaving out some bad reports will be OK. It would not be desired to get wrong sentences (grammar & spelling) into training.
I would suggest the following:
Add a command-line parameter -x to exclude reported.tsv, which should default to TRUE, but one could enable them.
Get rid of all recordings of sentences in reported.tsv in validated/train/dev/test sets
One might go further and add another field "verified" into reported.tsv, where a dataset engineer manually reviews them and only "verified" ones get removed.
The text was updated successfully, but these errors were encountered:
As far as I can see, the reported.tsv is not taken into account anywhere in the workflow.
I examined 240 records in v7.0 Turkish dataset and found these:
As you can see 22% were OK, 40% can be corrected and only 38% is rightly rejected. Things like "offensive language/slang" or "political" can be very subjective. When you analyze the rightfully reported ones, they are indeed wrong grammar, OCR mistakes or heavy use of foreign names.
Taking the size of the whole dataset, I think leaving out some bad reports will be OK. It would not be desired to get wrong sentences (grammar & spelling) into training.
I would suggest the following:
One might go further and add another field "verified" into reported.tsv, where a dataset engineer manually reviews them and only "verified" ones get removed.
The text was updated successfully, but these errors were encountered: