Filter monolingual data based on fluency scores #789

gregtatum · 2024-08-07T17:02:25Z

The HPLT dataset includes a fluency score. We should look at filtering our own data by this fluency metric, and see if it improves.

https://hplt-project.org/datasets/v1.2

I assume this would be useful for synthesizing back translations, and less useful for synthesizing distillation data.

https://aclanthology.org/2024.lrec-main.100.pdf

fluency score, computed with a 7-gram modified Knesser-Ney character language model

eu9ene · 2024-08-07T17:27:23Z

I filtered HPLT data with 0.8 and 0.9 scores after manual data inspection. I used 0.8 for distillation to have more data and 0.9 for back-translations assuming target sentences for back-translations should not include any noise so that the models don't learn to reproduce it.

It would be interesting to try this model for other monolingual data.

gregtatum added the quality Improving robustness and translation quality label Aug 7, 2024

gregtatum mentioned this issue Aug 7, 2024

[meta] General translation quality improvements #216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter monolingual data based on fluency scores #789

Filter monolingual data based on fluency scores #789

gregtatum commented Aug 7, 2024

eu9ene commented Aug 7, 2024

Filter monolingual data based on fluency scores #789

Filter monolingual data based on fluency scores #789

Comments

gregtatum commented Aug 7, 2024

eu9ene commented Aug 7, 2024