You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I filtered HPLT data with 0.8 and 0.9 scores after manual data inspection. I used 0.8 for distillation to have more data and 0.9 for back-translations assuming target sentences for back-translations should not include any noise so that the models don't learn to reproduce it.
It would be interesting to try this model for other monolingual data.
The HPLT dataset includes a fluency score. We should look at filtering our own data by this fluency metric, and see if it improves.
https://hplt-project.org/datasets/v1.2
I assume this would be useful for synthesizing back translations, and less useful for synthesizing distillation data.
https://aclanthology.org/2024.lrec-main.100.pdf
The text was updated successfully, but these errors were encountered: