Missing many special characters in desired_characters file (Swedish) #4

aslamy · 2018-11-24T09:38:54Z

The file desired_characters does not contains many of the important special characters like "@".
All special characters in english is also important for swedish language.
Law documents contains section sign § character. Please add this as well.

stweil · 2018-11-26T05:34:13Z

From tesseract-ocr/tesseract#2075:

It's also possible to use script/Latin for Swedish. That should contain all characters.

stweil · 2018-11-26T05:40:23Z

Only symbols included in swe.unicharset will be detected during OCR. If a symbol is missing, it can be added by fine tuning training.

Adding symbols to the desired_characters files helps for future trainings, so symbols won't be missed then, but does not change existing models.

amitdo · 2018-11-26T12:45:12Z

The desired_characters file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.

Kalle12345 · 2018-11-27T15:25:05Z

@amitdo should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? Is there any easier way? A training GUI for tesseract 4?

amitdo · 2018-11-28T12:34:07Z

should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ?

That supposed to be the way...
but it's not so easy.

Is there any easier way? A training GUI for tesseract 4?

I don't know.

poizan42 · 2020-01-10T12:02:52Z

The current danish traineddata has the same issue. Really danish should be exactly the same as swedish except for ö->ø and ä->æ (I see that specifically '@' was added recently to desired_characters, but no new training data generated).

stweil · 2020-01-10T12:16:25Z

@poizan42, I suggest to create a pull request which adds the missing characters to the list of desired characters.

You can try the script/Latin model which should already support all Danish characters, or you could enhance the existing dan.traineddata, either by fine-tuning (see link above) or by using tesstrain. I prefer tesstrain because I found it easier to use.

poizan42 · 2020-01-12T11:49:48Z

@stweil, I have created a PR in #34

stweil · 2020-01-12T12:15:37Z

I merged that PR now, thanks. Please note that we cannot expect new training done by Google, so it is up to the Open Source community (= you, me, ...) to use the fixed information and train new models.

stweil mentioned this issue Nov 26, 2018

Missing special characters in desired_characters file (Swedish) tesseract-ocr/tesseract#2075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing many special characters in desired_characters file (Swedish) #4

Missing many special characters in desired_characters file (Swedish) #4

aslamy commented Nov 24, 2018

stweil commented Nov 26, 2018

stweil commented Nov 26, 2018 •

edited

Loading

amitdo commented Nov 26, 2018

Kalle12345 commented Nov 27, 2018

amitdo commented Nov 28, 2018 •

edited

Loading

poizan42 commented Jan 10, 2020 •

edited

Loading

stweil commented Jan 10, 2020

poizan42 commented Jan 12, 2020

stweil commented Jan 12, 2020

Missing many special characters in desired_characters file (Swedish) #4

Missing many special characters in desired_characters file (Swedish) #4

Comments

aslamy commented Nov 24, 2018

stweil commented Nov 26, 2018

stweil commented Nov 26, 2018 • edited Loading

amitdo commented Nov 26, 2018

Kalle12345 commented Nov 27, 2018

amitdo commented Nov 28, 2018 • edited Loading

poizan42 commented Jan 10, 2020 • edited Loading

stweil commented Jan 10, 2020

poizan42 commented Jan 12, 2020

stweil commented Jan 12, 2020

stweil commented Nov 26, 2018 •

edited

Loading

amitdo commented Nov 28, 2018 •

edited

Loading

poizan42 commented Jan 10, 2020 •

edited

Loading