-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract fails to detect letters Å and å in Finnish language. #31
Comments
See the list of known characters (unicharset). The data for I move the issue to langdata_lstm. |
Yes, what should i do to make it happen? Collect some data and box them with some tool? where can i get the current data? Cannot see any images here https://github.com/tesseract-ocr/langdata_lstm/tree/master/fin I guess training is made by synthetic texts with those files? How many examples of å Å there should be? Anything else needs to be modified? Just the training_text singles_text desired characters?(any rules how exactly?) |
Also letters Q and q are missing from the data? There should be all letters at least abcdefghijklmnopqrstuvwxyzåäö I checked the characters through, only Åå and Qq are missing. Is it enough to modify fin.training_text to contain N-amount of missing letters? Or do I need to modify something else? |
I'd add all desired characters to To fix the problem, we still have to run new training ... |
Testing Tesseract to detect Finnish texts containing "swedish o" -> å. Seems it cannot detect them- Å and å correctly. I have also tried fin+swe model but more usually the fin model version of the text is selected.
Is the previous training files available somewhere? Probably the training data does not have enough Åå cases or it is not included even it is official letter.
The text was updated successfully, but these errors were encountered: