Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize only complete dictionary words only #297

Closed
dong77 opened this issue Apr 12, 2016 · 5 comments
Closed

Recognize only complete dictionary words only #297

dong77 opened this issue Apr 12, 2016 · 5 comments
Labels

Comments

@dong77
Copy link

dong77 commented Apr 12, 2016

A related stackoverflow question is here: http://stackoverflow.com/questions/20599768/tesseract-ocr-recognize-complete-dictionary-words-only.

Basically what I want to achieve is to ask Tesseract to recognize only complete words included in my custom dictionary (lang: chi_sim), or to find the best match.

Following the instruction in https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages, I applied a config file with the following content:

load_system_dawg     F
load_freq_dawg       F
user_words_file      /path/to/my/dictonary.user-words

But this doesn't seem to work: when I ask Tesseract to recognize word from this image

x,

$ tesseract /path/to/the/above/image.jpg stdout -l chi_sim /path/to/my/config_file

it gives me 硝酸嘛庸喹瓢膏 which is not in the dictionary at all. The best match is supposed to be 硝酸咪康唑乳膏 which is included in the dictionary.

I searched around and couldn't find a solution. Please help me out. Thank you.

@amitdo
Copy link
Collaborator

amitdo commented Apr 12, 2016

Basically what I want to achieve is to ask Tesseract to recognize only complete words included in my custom dictionary (lang: chi_sim)

Tesseract can't do this. The dictionaries are just a hint for Tesseract.

@tfmorris
Copy link
Contributor

You could try playing with some of the dictionary related parameters to see if you can achieve the results that you want:

$ tesseract --print-parameters | grep dic

In particular, these two look like they might have promise:

language_model_penalty_non_freq_dict_word   0.1 Penalty for words not in the frequent word dictionary
language_model_penalty_non_dict_word    0.15    Penalty for non-dictionary words

@dong77
Copy link
Author

dong77 commented Apr 13, 2016

Thank you, @amitdo and @tfmorris. I tried both language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word but had no luck.

@zdenop
Copy link
Contributor

zdenop commented Apr 17, 2016

AFAIK it is not possible within tesseract.
In some extent you can implement it by yourself. E.g. in first stage you do OCR by tesseract and then you can correct recognized text by other tool (e.g. spellchecker with custom dictionary).

@jeremycurrygit
Copy link

Could you have any method to solve the problem?@dong77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants