Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

diegodlh
Copy link

When running unicharset_extractor on the Spanish langdata, it warns that capital "Ñ", capital "É" and "«" are absent from the training text (while their counterparts, "ñ", "é" and "»", are present). This makes Tesseract then fail to recognize this characters with --oem 0 (for example, it recognizes "Ñ" as "NN", and "É" as "EI").
I'm a beginner in the subject of Tesseract training and I'm not sure how these training_text files are generated. It seems to me they are more or less a random set of words and short phrases. It occurred to me I could simply make some replacements to cover these missing characters: España -> ESPAÑA, años -> AÑOS, también -> TAMBIÉN, México -> MÉXICO, and also replaced half occurrences of "»" with "«".
If my assumption that this file is mostly random, please consider pulling this commit into master. Thank you

@Shreeshrii
Copy link
Contributor

Thank you. This training text file is suitable for tesseract 3.0x (base tesseract). For 4.0 and lstm training please see the langdata_lstm repo.

@diegodlh
Copy link
Author

Effectively, I retried tesstrain.sh with langdata_lstm and the training_text file is so long that this time unicharset_extractor did not complain about missing characters. Still, as users may still be using langdata to train their tesseract 3.0x engine (or tesseract 4.0 with --oem 0, as I understand it), I deem it useful to merge my commit into plain langdata's master branch. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants