Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice to train diacritics? #405

Open
IMerlin1009I opened this issue Nov 17, 2024 · 3 comments
Open

Best practice to train diacritics? #405

IMerlin1009I opened this issue Nov 17, 2024 · 3 comments

Comments

@IMerlin1009I
Copy link

I have been trying to fine-tune the german (deu.trainingdata) for diacritics because I have to OCR documents which contain names that have diacritics. The problem is that german has umlauts (ä, ö and ü) and no matter what I try it wont learn diacritics like á or â for example.

Things that I tried:

  • the first thing I tried was to use examples from the documents itself to train but since there are not that many examples I thought its just not sufficient
  • therefore the second thing I tried is to use a name database and write a script which basically generates as many examples of names including diacritics as I like but that also does not seem to work

So my question is am I doing something wrong or how should I approach fine-tuning german for diacritics?

Thanks in advance for any response highly appreciate :)

@stweil
Copy link
Collaborator

stweil commented Nov 18, 2024

Did you try the model script/Latin? It contains diacritics, so maybe there is no need for an additional training.

@IMerlin1009I
Copy link
Author

Sorry for the late response i have been a little busy latly.

I tried latin but then there are other issues it does not support 'ß' or 'ä' for example, so then I tried combining german and latin which also did not yield in wanted results.

Leaving me with my initial question is it possible to fine-tune german for diacritics and if so what would be the best practice?

@stweil
Copy link
Collaborator

stweil commented Nov 26, 2024

script/Latin supports ß and umlauts. Maybe you confused it with the Latin language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants