Skip to content
This repository has been archived by the owner on Sep 4, 2023. It is now read-only.

Please support for Catalan language #602

Closed
jordimas opened this issue Nov 30, 2022 · 10 comments
Closed

Please support for Catalan language #602

jordimas opened this issue Nov 30, 2022 · 10 comments
Labels
enhancement New feature or request language request

Comments

@jordimas
Copy link

Hello

Please consider adding Catalan language.

In this repository you have a large collection of open source aligned parallel corpus that you can use to train your system:

https://github.com/Softcatala/parallel-catalan-corpus

If you need more help to find dataset please let us know and we can help out

@andrenatal
Copy link
Contributor

Thanks @jordimas , this was helpful. I'll definitely reach out about it.

@andrenatal andrenatal added the enhancement New feature or request label Dec 1, 2022
@andrenatal
Copy link
Contributor

Hello @jordimas , do you have cleaned monolingual datasets available in Catalan so we could use?

@andrenatal
Copy link
Contributor

I suppose we can use common voice's? There's 1161772 sentences there.

@jordimas
Copy link
Author

My understanding is that you want monolingual datasets to do back translation, in this case, ideally the texts should not be part of the parallel corpus. On top of Common Voice that you mention, these corpus also can be helpful:

Let me know if you need more help

@andrenatal
Copy link
Contributor

andrenatal commented Feb 15, 2023 via email

@andrenatal
Copy link
Contributor

andrenatal commented Apr 21, 2023

Hi @jordimas We've trained Catalan to English using your corpora and merged the support for it, and the Nightly version of the extension containing it will be available to test tomorrow morning.

The model was also incorporated in the translations website: https://mozilla.github.io/translate/

Please test it and let us know what you think when you can.

Gràcies!

@jordimas
Copy link
Author

jordimas commented Apr 21, 2023

Thanks for your work @andrenatal

The nightly link provided in at https://github.com/mozilla/firefox-translations#nightly-builds under the text "Then install the extension by clicking here Firefox Translations - Install Nightly" gives a 404. However, I was able to text the web version.

It works reasonable well. My suggestion is that you compute the BLEU metrics against Flores200 or any other reference corpus.

The biggest issue that I found is that It does not translate upper case sentences

It's very easy to reproduce, just an example:

LA FUNDACIÓ MOZILLA ÉS UNA ORGANITZACIÓ SENSE ÀNIM DE LUCRE (ca) ->
The FOUNDATION FOUNDATION IS A ORGANIZATION OF LUCRE (en)
(which is a broken translation)

same text in lower case in properly translated:

La Fundació Mozilla és una organització sense ànim de lucre (ca) ->
The Mozilla Foundation is a non-profit organization

Some wild guess: you are not applying corpus augmentation for upper case during training or your TrueCase mechanism is not working properly.

Thanks again

@andrenatal
Copy link
Contributor

andrenatal commented Apr 21, 2023 via email

@andrenatal
Copy link
Contributor

andrenatal commented Apr 21, 2023 via email

@marco-c
Copy link
Contributor

marco-c commented Jun 9, 2023

We have an issue open around the problem of ALL CAPS: mozilla/translations#73.
I'll close this since Catalan is now supported.

@marco-c marco-c closed this as completed Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request language request
Projects
None yet
Development

No branches or pull requests

4 participants