Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

diegodlh · 2019-01-22T21:29:55Z

When running unicharset_extractor on the Spanish langdata, it warns that capital "Ñ", capital "É" and "«" are absent from the training text (while their counterparts, "ñ", "é" and "»", are present). This makes Tesseract then fail to recognize this characters with --oem 0 (for example, it recognizes "Ñ" as "NN", and "É" as "EI").
I'm a beginner in the subject of Tesseract training and I'm not sure how these training_text files are generated. It seems to me they are more or less a random set of words and short phrases. It occurred to me I could simply make some replacements to cover these missing characters: España -> ESPAÑA, años -> AÑOS, también -> TAMBIÉN, México -> MÉXICO, and also replaced half occurrences of "»" with "«".
If my assumption that this file is mostly random, please consider pulling this commit into master. Thank you

…pital "É" and "«"

Shreeshrii · 2019-01-23T04:25:22Z

Thank you. This training text file is suitable for tesseract 3.0x (base tesseract). For 4.0 and lstm training please see the langdata_lstm repo.

diegodlh · 2019-01-23T19:30:24Z

Effectively, I retried tesstrain.sh with langdata_lstm and the training_text file is so long that this time unicharset_extractor did not complain about missing characters. Still, as users may still be using langdata to train their tesseract 3.0x engine (or tesseract 4.0 with --oem 0, as I understand it), I deem it useful to merge my commit into plain langdata's master branch. Thanks!

Made some replacements to include missing characters: capital "Ñ", ca…

f69e748

…pital "É" and "«"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

diegodlh commented Jan 22, 2019

Shreeshrii commented Jan 23, 2019

diegodlh commented Jan 23, 2019

Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

Are you sure you want to change the base?

Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

Conversation

diegodlh commented Jan 22, 2019

Shreeshrii commented Jan 23, 2019

diegodlh commented Jan 23, 2019