-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Tesseract output with words coming from an external dictionary #2391
Comments
Please also try with traineddata files from |
The commit I am using was published 6 days ago. What about my questions? Can you help me? |
@bertsky implemented the feature about a month back, so the commit from 6 days ago should be ok. More changes are needed for the functionality that you need. |
Is there any plan to implement those functionalities? |
@DaDoLuX partially, yes there is. You are right in your observation that user words/patterns are currently only a hint. (If we would make them exclusive now, there would usually be empty results.) To make this feature more useful, we first have to find a way to widen the beam during beam search. (The LSTM engine is still optimised only for the 1-best path.) We are already working on that. Next, we could add an option to make user words/patterns exclusive, or to port the old As to your questions:
|
@DaDoLuX Is this still an issue for you / did you ever find a workaround? |
@bertsky Any update on this / has the future arrived yet? |
@jtlz2, I am afraid not. But we do have a better prospect by now, because the beam can deliver deep alternative paths, so exclusiveness does not come at the price of completely loosing characters anymore. I will have a look at this next week. |
@bertsky That would be absolutely amazing - much appreciated - huge thanks in advance for any updates! |
Hi everyone! I'm looking for the exactly same functionality (to be able to give a very strong incline towards matching words or even phrases from external dictionary). Is there any change on how this could be done? E.g. could I fine-tune some parameters like |
Sad to say that I don't think deep alternative paths (at least the way they are implemented now) can help us avoid empty hypotheses when exclusiveness is enforced. At least this is not as easy as it seemed. I tried several alleys (skipping the non-dawg paths in |
No, unfortunately not. Even params like
You can do that (on the character level) via |
Thanks a lot for your help, will try to use this API as a workaround. |
Hi. I can not use user_words option with bazaar config. Hi. I posted a question in this link. So can anyone check and answer it for me? |
does anyone have any idea? please help |
@astrung You can take tesseract's output, iterate through every word, calculate Levenshtein distance for your dictionary, and correct every word if you want to. I think it's not Tesseract's job to limit number of possible words. |
Hi everyone!
|
@grumd sorry but how can i get all possible output for each word (or character) from tesseract? i tried to read it from python, but it looks like that API works only in C++ |
@davideromano (or anyone else) i would appreciate it if you could add me on discord (Nikolai Adaktilof#7566) and talk because i have a quick quiestion that is interesting and noone can answer in the whole internet hahaha |
is there any way to detect spanish only words? |
You must install a language pack https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_lang_list.md |
Environment
Current Behavior:
I am working with the following image, reppresented the word "rosanna":
I downloaded
italian
andenglish
tessdata available in the github repository (Link).In the
user-words
file I wrote down the expected wordrosanna
; this will be given to Tesseract as external dictionary.The goal is to read the image and to force Tesseract to change what it read choosing only between a limited subset of words (an external dictionary given as input).
Following experiments were tried:
Used command:
tesseract pic.png out.txt -l ita --psm 8 --oem 1 tsv
Used command:
tesseract pic.png out.txt -l ita --psm 8 --oem 1 --user-words /usr/local/share/tessdata/ita.user-words tsv
Used command:
tesseract pic.png out.txt -l eng --psm 8 --oem 1 tsv
Used command:
tesseract pic.png out.txt -l eng --psm 8 --oem 1 --user-words /usr/local/share/tessdata/eng.user-words tsv
You can notice that Tesseract didn’t correct what it read. I tried to add to the external dictionary also the word
rosonna
. Here the results:Then I tried to add also the word
rosoana
to the external dictionary. Here the results:It seems that Tesseract read the images using the external dictionary only as hint. During the first round of reading, it read the name “rosoana” with a confidence of 0%; and it didn’t read correctly “rosanna” probably because it is too much distant from what it read. After the words “rosonna” and “rosoana” were added, the confidence gradually increase.
Questions:
1) Is there a way to force Tesseract to correct the read word only choosing the nearest word from the external dictionary?
2) Is it possible that Tesseract correct the read words only with words that have only one different letter? (Words that are one letter distant from what Tesseract read)
The text was updated successfully, but these errors were encountered: