Best practice to train diacritics? #405

IMerlin1009I · 2024-11-17T23:20:47Z

I have been trying to fine-tune the german (deu.trainingdata) for diacritics because I have to OCR documents which contain names that have diacritics. The problem is that german has umlauts (ä, ö and ü) and no matter what I try it wont learn diacritics like á or â for example.

Things that I tried:

the first thing I tried was to use examples from the documents itself to train but since there are not that many examples I thought its just not sufficient
therefore the second thing I tried is to use a name database and write a script which basically generates as many examples of names including diacritics as I like but that also does not seem to work

So my question is am I doing something wrong or how should I approach fine-tuning german for diacritics?

Thanks in advance for any response highly appreciate :)

stweil · 2024-11-18T06:20:26Z

Did you try the model script/Latin? It contains diacritics, so maybe there is no need for an additional training.

IMerlin1009I · 2024-11-25T23:41:18Z

Sorry for the late response i have been a little busy latly.

I tried latin but then there are other issues it does not support 'ß' or 'ä' for example, so then I tried combining german and latin which also did not yield in wanted results.

Leaving me with my initial question is it possible to fine-tune german for diacritics and if so what would be the best practice?

stweil · 2024-11-26T06:33:36Z

script/Latin supports ß and umlauts. Maybe you confused it with the Latin language.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice to train diacritics? #405

Best practice to train diacritics? #405

IMerlin1009I commented Nov 17, 2024

stweil commented Nov 18, 2024

IMerlin1009I commented Nov 25, 2024

stweil commented Nov 26, 2024

Best practice to train diacritics? #405

Best practice to train diacritics? #405

Comments

IMerlin1009I commented Nov 17, 2024

stweil commented Nov 18, 2024

IMerlin1009I commented Nov 25, 2024

stweil commented Nov 26, 2024