Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract fails to detect letters Å and å in Finnish language. #31

Open
jmokoistinen opened this issue Nov 13, 2019 · 4 comments
Open

Comments

@jmokoistinen
Copy link

Testing Tesseract to detect Finnish texts containing "swedish o" -> å. Seems it cannot detect them- Å and å correctly. I have also tried fin+swe model but more usually the fin model version of the text is selected.

Is the previous training files available somewhere? Probably the training data does not have enough Åå cases or it is not included even it is official letter.

@stweil
Copy link
Member

stweil commented Dec 17, 2019

See the list of known characters (unicharset). The data for fin in langdata_lstm needs to be fixed. Do you want to send a fix (pull request)?

I move the issue to langdata_lstm.

@stweil stweil transferred this issue from tesseract-ocr/tessdata_best Dec 17, 2019
@jmokoistinen
Copy link
Author

jmokoistinen commented Feb 12, 2020

Yes, what should i do to make it happen? Collect some data and box them with some tool? where can i get the current data? Cannot see any images here https://github.com/tesseract-ocr/langdata_lstm/tree/master/fin

I guess training is made by synthetic texts with those files? How many examples of å Å there should be? Anything else needs to be modified? Just the training_text singles_text desired characters?(any rules how exactly?)

@jmokoistinen
Copy link
Author

jmokoistinen commented Mar 2, 2020

Also letters Q and q are missing from the data? There should be all letters at least abcdefghijklmnopqrstuvwxyzåäö
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
1234567890
How can these be fixed?

I checked the characters through, only Åå and Qq are missing. Is it enough to modify fin.training_text to contain N-amount of missing letters? Or do I need to modify something else?

@stweil
Copy link
Member

stweil commented Mar 2, 2020

I'd add all desired characters to desired_characters, ideally sorted with LANG=C.UTF-8 sort. Then we at least have a list of those characters and can try to find training texts which include them sufficiently often.

To fix the problem, we still have to run new training ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants