Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter monolingual data based on fluency scores #789

Open
Tracked by #216
gregtatum opened this issue Aug 7, 2024 · 1 comment
Open
Tracked by #216

Filter monolingual data based on fluency scores #789

gregtatum opened this issue Aug 7, 2024 · 1 comment
Labels
quality Improving robustness and translation quality

Comments

@gregtatum
Copy link
Member

The HPLT dataset includes a fluency score. We should look at filtering our own data by this fluency metric, and see if it improves.

https://hplt-project.org/datasets/v1.2

I assume this would be useful for synthesizing back translations, and less useful for synthesizing distillation data.

https://aclanthology.org/2024.lrec-main.100.pdf

fluency score, computed with a 7-gram modified Knesser-Ney character language model

@gregtatum gregtatum added the quality Improving robustness and translation quality label Aug 7, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Aug 7, 2024

I filtered HPLT data with 0.8 and 0.9 scores after manual data inspection. I used 0.8 for distillation to have more data and 0.9 for back-translations assuming target sentences for back-translations should not include any noise so that the models don't learn to reproduce it.

It would be interesting to try this model for other monolingual data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

2 participants