Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when translating ALL CAPS sentences #73

Closed
Tracked by #238
XapaJIaMnu opened this issue Feb 16, 2022 · 3 comments
Closed
Tracked by #238

Poor performance when translating ALL CAPS sentences #73

XapaJIaMnu opened this issue Feb 16, 2022 · 3 comments
Labels
quality Improving robustness and translation quality

Comments

@XapaJIaMnu
Copy link
Contributor

Our models are trained mostly on data that has proper capitalisation, but in the wild people and websites sometimes use ALL CAPS when typing. Since our models haven't seen those words during training they mostly end up copying them to the target as opposed to translating them. We could probably fix this with --all-caps-every option:
https://github.com/marian-nmt/marian-dev/blob/601c9ac9807b5ffcbed298952435d9a17d954575/src/common/config_parser.cpp#L909

We should investigate what would be good values for that. Every 100? Every 75? Every 50?

@marco-c
Copy link
Collaborator

marco-c commented Mar 29, 2024

@eu9ene should we close this now?

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 1, 2024

Let's wait until we ship some models in Nightly.

@eu9ene
Copy link
Collaborator

eu9ene commented May 9, 2024

Ok, it might take a while since we're focusing on general cleaning now and only when it's done we'll start a full retraining. I think let's close some of those robustness issues as we clearly saw performance improvements on evaluation datasets.

@eu9ene eu9ene closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants