Recognize only complete dictionary words only #297

dong77 · 2016-04-12T07:57:39Z

A related stackoverflow question is here: http://stackoverflow.com/questions/20599768/tesseract-ocr-recognize-complete-dictionary-words-only.

Basically what I want to achieve is to ask Tesseract to recognize only complete words included in my custom dictionary (lang: chi_sim), or to find the best match.

Following the instruction in https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages, I applied a config file with the following content:

load_system_dawg     F
load_freq_dawg       F
user_words_file      /path/to/my/dictonary.user-words

But this doesn't seem to work: when I ask Tesseract to recognize word from this image

,

$ tesseract /path/to/the/above/image.jpg stdout -l chi_sim /path/to/my/config_file

it gives me 硝酸嘛庸喹瓢膏 which is not in the dictionary at all. The best match is supposed to be 硝酸咪康唑乳膏 which is included in the dictionary.

I searched around and couldn't find a solution. Please help me out. Thank you.

The text was updated successfully, but these errors were encountered:

amitdo · 2016-04-12T10:41:52Z

Basically what I want to achieve is to ask Tesseract to recognize only complete words included in my custom dictionary (lang: chi_sim)

Tesseract can't do this. The dictionaries are just a hint for Tesseract.

tfmorris · 2016-04-12T16:40:35Z

You could try playing with some of the dictionary related parameters to see if you can achieve the results that you want:

$ tesseract --print-parameters | grep dic

In particular, these two look like they might have promise:

language_model_penalty_non_freq_dict_word   0.1 Penalty for words not in the frequent word dictionary
language_model_penalty_non_dict_word    0.15    Penalty for non-dictionary words

dong77 · 2016-04-13T05:09:20Z

Thank you, @amitdo and @tfmorris. I tried both language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word but had no luck.

zdenop · 2016-04-17T16:31:50Z

AFAIK it is not possible within tesseract.
In some extent you can implement it by yourself. E.g. in first stage you do OCR by tesseract and then you can correct recognized text by other tool (e.g. spellchecker with custom dictionary).

jeremycurrygit · 2018-05-23T07:09:16Z

Could you have any method to solve the problem?@dong77

amitdo added the question label May 27, 2016

amitdo closed this as completed May 27, 2016

wosiu mentioned this issue May 30, 2017

user pattern/dict does not work at all #960

Closed

Shreeshrii mentioned this issue Mar 21, 2019

trying to add user words/patterns again: #2328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize only complete dictionary words only #297

Recognize only complete dictionary words only #297

dong77 commented Apr 12, 2016

amitdo commented Apr 12, 2016

tfmorris commented Apr 12, 2016

dong77 commented Apr 13, 2016

zdenop commented Apr 17, 2016

jeremycurrygit commented May 23, 2018

Recognize only complete dictionary words only #297

Recognize only complete dictionary words only #297

Comments

dong77 commented Apr 12, 2016

amitdo commented Apr 12, 2016

tfmorris commented Apr 12, 2016

dong77 commented Apr 13, 2016

zdenop commented Apr 17, 2016

jeremycurrygit commented May 23, 2018