Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Tesseract output with words coming from an external dictionary #2391

Closed
davideromano opened this issue Apr 16, 2019 · 21 comments
Closed

Comments

@davideromano
Copy link

davideromano commented Apr 16, 2019

Environment

Current Behavior:

I am working with the following image, reppresented the word "rosanna":

pic

I downloaded italian and english tessdata available in the github repository (Link).
In the user-words file I wrote down the expected word rosanna; this will be given to Tesseract as external dictionary.
The goal is to read the image and to force Tesseract to change what it read choosing only between a limited subset of words (an external dictionary given as input).

Following experiments were tried:

  1. Tesseract with italian language, without external dictionary: OCR read 7o0s0ana (0% confidence)
    Used command: tesseract pic.png out.txt -l ita --psm 8 --oem 1 tsv
  2. Tesseract with italian language, with external dictionary: OCR read 7o0s0ana (0% confidence)
    Used command: tesseract pic.png out.txt -l ita --psm 8 --oem 1 --user-words /usr/local/share/tessdata/ita.user-words tsv
  3. Tesseract with english language, without external dictionary: OCR read rosoana (0% confidence)
    Used command: tesseract pic.png out.txt -l eng --psm 8 --oem 1 tsv
  4. Tesseract with english language, with external dictionary: OCR read rosoana (0% confidence)
    Used command: tesseract pic.png out.txt -l eng --psm 8 --oem 1 --user-words /usr/local/share/tessdata/eng.user-words tsv

You can notice that Tesseract didn’t correct what it read. I tried to add to the external dictionary also the word rosonna. Here the results:

  1. Tesseract with italian language, with external dictionary: OCR read 7o0s0ana (0% confidence)
  2. Tesseract with english language, with external dictionary: OCR read rosonna (41% confidence)

Then I tried to add also the word rosoana to the external dictionary. Here the results:

  1. Tesseract with italian language, with external dictionary: OCR read 7o0s0ana (0% confidence)
  2. Tesseract with english language, with external dictionary: OCR read rosoana (46% confidence)

It seems that Tesseract read the images using the external dictionary only as hint. During the first round of reading, it read the name “rosoana” with a confidence of 0%; and it didn’t read correctly “rosanna” probably because it is too much distant from what it read. After the words “rosonna” and “rosoana” were added, the confidence gradually increase.

Questions:

1) Is there a way to force Tesseract to correct the read word only choosing the nearest word from the external dictionary?
2) Is it possible that Tesseract correct the read words only with words that have only one different letter? (Words that are one letter distant from what Tesseract read)

@Shreeshrii
Copy link
Collaborator

@davideromano
Copy link
Author

The commit I am using was published 6 days ago.
Anyway, I will try newer commits.

What about my questions? Can you help me?

@Shreeshrii
Copy link
Collaborator

@bertsky implemented the feature about a month back, so the commit from 6 days ago should be ok.

More changes are needed for the functionality that you need.

@davideromano
Copy link
Author

Is there any plan to implement those functionalities?

@bertsky
Copy link
Contributor

bertsky commented Apr 30, 2019

@DaDoLuX partially, yes there is. You are right in your observation that user words/patterns are currently only a hint. (If we would make them exclusive now, there would usually be empty results.)

To make this feature more useful, we first have to find a way to widen the beam during beam search. (The LSTM engine is still optimised only for the 1-best path.) We are already working on that.

Next, we could add an option to make user words/patterns exclusive, or to port the old tessedit_enable_dict_correction functionality. Until then, I doubt there is much you can do as a user.

As to your questions:

  1. Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).
  2. This would be post-correction (or its special case spelling correction). You will have to do that externally, based either on the normal Tesseract output or using ChoiceIterator with the API (preferably tesserocr in Python). There are many different approaches to this (from simple heuristics like edit distance or string hashing to more elaborate systems including statistical language modelling and statistical error modelling with automata or neural networks, which can be either supervised on some training data or even document-adaptive). If you are looking for ready-to-use tools, ispell/aspell/hunspell might already be satisfactory, otherwise check out Ochre or even PICCL.

@jtlz2
Copy link

jtlz2 commented Sep 11, 2019

@DaDoLuX Is this still an issue for you / did you ever find a workaround?

@jtlz2
Copy link

jtlz2 commented Sep 11, 2019

  1. Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).

@bertsky Any update on this / has the future arrived yet?

@bertsky
Copy link
Contributor

bertsky commented Sep 11, 2019

  1. Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).

@bertsky Any update on this / has the future arrived yet?

@jtlz2, I am afraid not. But we do have a better prospect by now, because the beam can deliver deep alternative paths, so exclusiveness does not come at the price of completely loosing characters anymore. I will have a look at this next week.

@jtlz2
Copy link

jtlz2 commented Sep 12, 2019

@bertsky That would be absolutely amazing - much appreciated - huge thanks in advance for any updates!

@PavelKovalets
Copy link

PavelKovalets commented Oct 9, 2019

Hi everyone! I'm looking for the exactly same functionality (to be able to give a very strong incline towards matching words or even phrases from external dictionary). Is there any change on how this could be done?

E.g. could I fine-tune some parameters like language_model_penalty_non_freq_dict_word to make dictionary words more significant? Or somehow get top X most probable words and analyse those myself?

@bertsky
Copy link
Contributor

bertsky commented Oct 9, 2019

  1. Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).

@bertsky Any update on this / has the future arrived yet?

@jtlz2, I am afraid not. But we do have a better prospect by now, because the beam can deliver deep alternative paths, so exclusiveness does not come at the price of completely loosing characters anymore. I will have a look at this next week.

Sad to say that I don't think deep alternative paths (at least the way they are implemented now) can help us avoid empty hypotheses when exclusiveness is enforced. At least this is not as easy as it seemed. I tried several alleys (skipping the non-dawg paths in RecodeBeamSearch::ExtractBestPath(), skipping all non-dawg BeamIndex calls to ``RecodeBeamSearch::DecodeStep()`) but with no success whatsoever. Anyone got better ideas?

@bertsky
Copy link
Contributor

bertsky commented Oct 9, 2019

E.g. could I fine-tune some parameters like language_model_penalty_non_freq_dict_word to make dictionary words more significant?

No, unfortunately not. Even params like language_model_penalty_non_dict_word are not in the LSTM call chain, and the constants that could influence this are not exposed as parameters (and don't really work). Our best shot would be to introduce LM-only behaviour in lstm/recodebeam.cpp (as outlined above), and then add a new option to activate this. But for now I must give up, being too occupied – sorry!

Or somehow get top X most probable words and analyse those myself?

You can do that (on the character level) via ChoiceIterator in the API – see above.

@PavelKovalets
Copy link

PavelKovalets commented Oct 10, 2019

You can do that (on the character level) via ChoiceIterator in the API – see above.

Thanks a lot for your help, will try to use this API as a workaround.

@astrung
Copy link

astrung commented Dec 12, 2019

Hi. I can not use user_words option with bazaar config. Hi. I posted a question in this link. So can anyone check and answer it for me?
https://stackoverflow.com/questions/59307205/tesseract-5-0-bazaar-user-words-config-doesnt-work

@astrung
Copy link

astrung commented Dec 16, 2019

does anyone have any idea? please help

@grumd
Copy link

grumd commented Feb 3, 2020

@astrung You can take tesseract's output, iterate through every word, calculate Levenshtein distance for your dictionary, and correct every word if you want to. I think it's not Tesseract's job to limit number of possible words.

@SergeyMalyshevsky
Copy link

SergeyMalyshevsky commented Apr 23, 2020

Hi everyone!
I have the same problem with using external dictionary in tesseract. I tried to use --user-words parameter. I tried to change language_model_penalty_non_freq_dict_word, language_model_penalty_non_dict_word and etc. But no one method did not work. I have several questions about this:

  1. At this moment there are no updates which let me choosing only nearest words from external dictionary? Aren't they?
  2. I tried to find library which can spell text after tesseract processing. But I have found it only for english language. I need russian language support library. Does anybody know good multilanguage spell checker and autocorrector or utils where I can create custom dictionary?

@astrung
Copy link

astrung commented May 23, 2020

@grumd sorry but how can i get all possible output for each word (or character) from tesseract? i tried to read it from python, but it looks like that API works only in C++

@NikSimeo04
Copy link

@davideromano (or anyone else) i would appreciate it if you could add me on discord (Nikolai Adaktilof#7566) and talk because i have a quick quiestion that is interesting and noone can answer in the whole internet hahaha

@Paulie-Aditya
Copy link

is there any way to detect spanish only words?
I would like to have the control over the vocabulary as well.

@GoulartNogueira
Copy link

is there any way to detect spanish only words?

You must install a language pack

https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests