Change Tesseract output with words coming from an external dictionary #2391

davideromano · 2019-04-16T15:31:04Z

Environment

Tesseract Version: tesseract 4.1.0-rc1-102-g297d7d
Commit Number: eda953c
Platform: Linux c693483cbd58 4.15.0-47-generic error LNK2019: unresolved external symbol __imp__l_CIDataDestroy referenced in function - libtesseract304 #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

I am working with the following image, reppresented the word "rosanna":

I downloaded italian and english tessdata available in the github repository (Link).
In the user-words file I wrote down the expected word rosanna; this will be given to Tesseract as external dictionary.
The goal is to read the image and to force Tesseract to change what it read choosing only between a limited subset of words (an external dictionary given as input).

Following experiments were tried:

Tesseract with italian language, without external dictionary: OCR read 7o0s0ana (0% confidence)
Used command: tesseract pic.png out.txt -l ita --psm 8 --oem 1 tsv
Tesseract with italian language, with external dictionary: OCR read 7o0s0ana (0% confidence)
Used command: tesseract pic.png out.txt -l ita --psm 8 --oem 1 --user-words /usr/local/share/tessdata/ita.user-words tsv
Tesseract with english language, without external dictionary: OCR read rosoana (0% confidence)
Used command: tesseract pic.png out.txt -l eng --psm 8 --oem 1 tsv
Tesseract with english language, with external dictionary: OCR read rosoana (0% confidence)
Used command: tesseract pic.png out.txt -l eng --psm 8 --oem 1 --user-words /usr/local/share/tessdata/eng.user-words tsv

You can notice that Tesseract didn’t correct what it read. I tried to add to the external dictionary also the word rosonna. Here the results:

Tesseract with italian language, with external dictionary: OCR read 7o0s0ana (0% confidence)
Tesseract with english language, with external dictionary: OCR read rosonna (41% confidence)

Then I tried to add also the word rosoana to the external dictionary. Here the results:

Tesseract with italian language, with external dictionary: OCR read 7o0s0ana (0% confidence)
Tesseract with english language, with external dictionary: OCR read rosoana (46% confidence)

It seems that Tesseract read the images using the external dictionary only as hint. During the first round of reading, it read the name “rosoana” with a confidence of 0%; and it didn’t read correctly “rosanna” probably because it is too much distant from what it read. After the words “rosonna” and “rosoana” were added, the confidence gradually increase.

Questions:

1) Is there a way to force Tesseract to correct the read word only choosing the nearest word from the external dictionary?
2) Is it possible that Tesseract correct the read words only with words that have only one different letter? (Words that are one letter distant from what Tesseract read)

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2019-04-16T15:41:14Z

Please also try with traineddata files from

davideromano · 2019-04-16T16:07:06Z

The commit I am using was published 6 days ago.
Anyway, I will try newer commits.

What about my questions? Can you help me?

Shreeshrii · 2019-04-16T18:18:11Z

@bertsky implemented the feature about a month back, so the commit from 6 days ago should be ok.

More changes are needed for the functionality that you need.

davideromano · 2019-04-30T09:19:04Z

Is there any plan to implement those functionalities?

bertsky · 2019-04-30T11:38:38Z

@DaDoLuX partially, yes there is. You are right in your observation that user words/patterns are currently only a hint. (If we would make them exclusive now, there would usually be empty results.)

To make this feature more useful, we first have to find a way to widen the beam during beam search. (The LSTM engine is still optimised only for the 1-best path.) We are already working on that.

Next, we could add an option to make user words/patterns exclusive, or to port the old tessedit_enable_dict_correction functionality. Until then, I doubt there is much you can do as a user.

As to your questions:

Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).
This would be post-correction (or its special case spelling correction). You will have to do that externally, based either on the normal Tesseract output or using ChoiceIterator with the API (preferably tesserocr in Python). There are many different approaches to this (from simple heuristics like edit distance or string hashing to more elaborate systems including statistical language modelling and statistical error modelling with automata or neural networks, which can be either supervised on some training data or even document-adaptive). If you are looking for ready-to-use tools, ispell/aspell/hunspell might already be satisfactory, otherwise check out Ochre or even PICCL.

jtlz2 · 2019-09-11T10:02:59Z

@DaDoLuX Is this still an issue for you / did you ever find a workaround?

jtlz2 · 2019-09-11T10:52:46Z

Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).

@bertsky Any update on this / has the future arrived yet?

bertsky · 2019-09-11T18:47:20Z

Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).

@bertsky Any update on this / has the future arrived yet?

@jtlz2, I am afraid not. But we do have a better prospect by now, because the beam can deliver deep alternative paths, so exclusiveness does not come at the price of completely loosing characters anymore. I will have a look at this next week.

jtlz2 · 2019-09-12T06:43:59Z

@bertsky That would be absolutely amazing - much appreciated - huge thanks in advance for any updates!

PavelKovalets · 2019-10-09T07:14:54Z

Hi everyone! I'm looking for the exactly same functionality (to be able to give a very strong incline towards matching words or even phrases from external dictionary). Is there any change on how this could be done?

E.g. could I fine-tune some parameters like language_model_penalty_non_freq_dict_word to make dictionary words more significant? Or somehow get top X most probable words and analyse those myself?

bertsky · 2019-10-09T22:13:13Z

Currently not, but there will likely be a switch to make user words/patterns exclusive in the future (see above).

@bertsky Any update on this / has the future arrived yet?

@jtlz2, I am afraid not. But we do have a better prospect by now, because the beam can deliver deep alternative paths, so exclusiveness does not come at the price of completely loosing characters anymore. I will have a look at this next week.

Sad to say that I don't think deep alternative paths (at least the way they are implemented now) can help us avoid empty hypotheses when exclusiveness is enforced. At least this is not as easy as it seemed. I tried several alleys (skipping the non-dawg paths in RecodeBeamSearch::ExtractBestPath(), skipping all non-dawg BeamIndex calls to ``RecodeBeamSearch::DecodeStep()`) but with no success whatsoever. Anyone got better ideas?

bertsky · 2019-10-09T22:18:06Z

E.g. could I fine-tune some parameters like language_model_penalty_non_freq_dict_word to make dictionary words more significant?

No, unfortunately not. Even params like language_model_penalty_non_dict_word are not in the LSTM call chain, and the constants that could influence this are not exposed as parameters (and don't really work). Our best shot would be to introduce LM-only behaviour in lstm/recodebeam.cpp (as outlined above), and then add a new option to activate this. But for now I must give up, being too occupied – sorry!

Or somehow get top X most probable words and analyse those myself?

You can do that (on the character level) via ChoiceIterator in the API – see above.

PavelKovalets · 2019-10-10T06:18:19Z

You can do that (on the character level) via ChoiceIterator in the API – see above.

Thanks a lot for your help, will try to use this API as a workaround.

astrung · 2019-12-12T15:05:06Z

Hi. I can not use user_words option with bazaar config. Hi. I posted a question in this link. So can anyone check and answer it for me?
https://stackoverflow.com/questions/59307205/tesseract-5-0-bazaar-user-words-config-doesnt-work

astrung · 2019-12-16T01:22:27Z

does anyone have any idea? please help

grumd · 2020-02-03T16:28:47Z

@astrung You can take tesseract's output, iterate through every word, calculate Levenshtein distance for your dictionary, and correct every word if you want to. I think it's not Tesseract's job to limit number of possible words.

SergeyMalyshevsky · 2020-04-23T19:59:24Z

Hi everyone!
I have the same problem with using external dictionary in tesseract. I tried to use --user-words parameter. I tried to change language_model_penalty_non_freq_dict_word, language_model_penalty_non_dict_word and etc. But no one method did not work. I have several questions about this:

At this moment there are no updates which let me choosing only nearest words from external dictionary? Aren't they?
I tried to find library which can spell text after tesseract processing. But I have found it only for english language. I need russian language support library. Does anybody know good multilanguage spell checker and autocorrector or utils where I can create custom dictionary?

astrung · 2020-05-23T03:20:17Z

@grumd sorry but how can i get all possible output for each word (or character) from tesseract? i tried to read it from python, but it looks like that API works only in C++

NikSimeo04 · 2021-04-19T15:04:27Z

@davideromano (or anyone else) i would appreciate it if you could add me on discord (Nikolai Adaktilof#7566) and talk because i have a quick quiestion that is interesting and noone can answer in the whole internet hahaha

Paulie-Aditya · 2024-03-16T09:28:11Z

is there any way to detect spanish only words?
I would like to have the control over the vocabulary as well.

GoulartNogueira · 2024-12-06T19:19:31Z

is there any way to detect spanish only words?

You must install a language pack

https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md

bertsky mentioned this issue Dec 10, 2019

Training parts lists tesseract-ocr/tesstrain#131

Closed

amitdo added the feature request label May 12, 2020

davideromano closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Tesseract output with words coming from an external dictionary #2391

Change Tesseract output with words coming from an external dictionary #2391

davideromano commented Apr 16, 2019 •

edited

Loading

Shreeshrii commented Apr 16, 2019

davideromano commented Apr 16, 2019

Shreeshrii commented Apr 16, 2019

davideromano commented Apr 30, 2019

bertsky commented Apr 30, 2019

jtlz2 commented Sep 11, 2019

jtlz2 commented Sep 11, 2019

bertsky commented Sep 11, 2019

jtlz2 commented Sep 12, 2019

PavelKovalets commented Oct 9, 2019 •

edited

Loading

bertsky commented Oct 9, 2019

bertsky commented Oct 9, 2019 •

edited

Loading

PavelKovalets commented Oct 10, 2019 •

edited

Loading

astrung commented Dec 12, 2019

astrung commented Dec 16, 2019

grumd commented Feb 3, 2020

SergeyMalyshevsky commented Apr 23, 2020 •

edited

Loading

astrung commented May 23, 2020

NikSimeo04 commented Apr 19, 2021

Paulie-Aditya commented Mar 16, 2024

GoulartNogueira commented Dec 6, 2024

Change Tesseract output with words coming from an external dictionary #2391

Change Tesseract output with words coming from an external dictionary #2391

Comments

davideromano commented Apr 16, 2019 • edited Loading

Environment

Current Behavior:

Questions:

Shreeshrii commented Apr 16, 2019

davideromano commented Apr 16, 2019

Shreeshrii commented Apr 16, 2019

davideromano commented Apr 30, 2019

bertsky commented Apr 30, 2019

jtlz2 commented Sep 11, 2019

jtlz2 commented Sep 11, 2019

bertsky commented Sep 11, 2019

jtlz2 commented Sep 12, 2019

PavelKovalets commented Oct 9, 2019 • edited Loading

bertsky commented Oct 9, 2019

bertsky commented Oct 9, 2019 • edited Loading

PavelKovalets commented Oct 10, 2019 • edited Loading

astrung commented Dec 12, 2019

astrung commented Dec 16, 2019

grumd commented Feb 3, 2020

SergeyMalyshevsky commented Apr 23, 2020 • edited Loading

astrung commented May 23, 2020

NikSimeo04 commented Apr 19, 2021

Paulie-Aditya commented Mar 16, 2024

GoulartNogueira commented Dec 6, 2024

davideromano commented Apr 16, 2019 •

edited

Loading

PavelKovalets commented Oct 9, 2019 •

edited

Loading

bertsky commented Oct 9, 2019 •

edited

Loading

PavelKovalets commented Oct 10, 2019 •

edited

Loading

SergeyMalyshevsky commented Apr 23, 2020 •

edited

Loading