-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing many special characters in desired_characters file (Swedish) #4
Comments
From tesseract-ocr/tesseract#2075:
|
Only symbols included in swe.unicharset will be detected during OCR. If a symbol is missing, it can be added by fine tuning training. Adding symbols to the |
The |
@amitdo should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? Is there any easier way? A training GUI for tesseract 4? |
That supposed to be the way...
I don't know. |
The current danish traineddata has the same issue. Really danish should be exactly the same as swedish except for ö->ø and ä->æ (I see that specifically '@' was added recently to desired_characters, but no new training data generated). |
@poizan42, I suggest to create a pull request which adds the missing characters to the list of desired characters. You can try the script/Latin model which should already support all Danish characters, or you could enhance the existing dan.traineddata, either by fine-tuning (see link above) or by using tesstrain. I prefer tesstrain because I found it easier to use. |
I merged that PR now, thanks. Please note that we cannot expect new training done by Google, so it is up to the Open Source community (= you, me, ...) to use the fixed information and train new models. |
The file desired_characters does not contains many of the important special characters like "@".
All special characters in english is also important for swedish language.
Law documents contains section sign § character. Please add this as well.
The text was updated successfully, but these errors were encountered: