-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user pattern/dict does not work at all #960
Comments
Same problem with user dictionary: |
Tested user-words option with 3.05.01 on windows (using binaries by @stweil) Works ok. See attached test image. bazaar config file as used (uses system dictionary + user words)
eng.user-words as used
image used for recognition Output without user-words- Notice
Output with user-words -
So |
I tested for user-patterns just now with versions 3.02 and 3.05.01, both for windows so that I didn't have to worry about correct versions of leptonica. The test image is attached. There is no change in output with the user-patterns option in both. So, if this feature worked, it would be before 3.02. However, just by resizing the image to 200%, the dates are correctly recognized. |
No, sorry, I never used that option. Nevertheless I also have a scenario where working user patterns would help. |
Same answer. |
@stweil Interesting project :-)
http://code.google.com/p/tesseract-ocr/source/browse/tags/release-3.01/dict/trie.h So it broke somewhere between 3.01 and 3.02... |
I did not use it either. |
User patterns are documented in |
With 4.0 the problem might be that the Dict class is instantiated twice
and then here
and both initialise The real problem is that variables are set between these calls so LSTM dict does not get the value from user defined variables. |
Does this issue only happen on the command line executable? I mean I can workaround this issue by writing some C++ source file to directly call the API? Thanks. |
@asmwarrior Answering your question: Both command line and API are affected. |
Please also see comment by Ray at #403 (comment) Don't think it has been addressed yet. @stweil Is this something you can fix? |
@vidiecan you mentioned earlier that 'With 4.0 the problem might be that the Dict class is instantiated twice'. Do you have a suggested patch to fix this issue? |
Any update to this issue?
I am sure it is finding this file because if I change the name of 'bazaar' it throws a warning saying file is not found. The contents of the bazaar file is the standard -
I populate the eng.user-patterns file in the tessdata folder with the standard values as default and also add my own to equate for values I need to capture correctly from a page -
However, I do not see any change in the results I am seeing. I know it is supposed to influence the results vs force, but the text looks so clearly incorrect there must be an issue. The last time I did a build from source was around a month ago. Any help is greatly appreciated. |
We tried using |
So does this work when Tesseract 4 is used with |
Closed as duplicate to #403 |
They do not work for me. I've been trying versions: 3.05.00 and 4.00.00alpha.
My file date.user-pattern contains one line:
2014-\d\d-\d\d
Picture is one line with date, like: 2014-03-19
I run: tesseract img.jpg stdout --user-patterns date.user-patterns -psm 8
and output: "mum-w" which obviously does not match the pattern.
Character whitelisting helps a bit, but format from pattern is not preserve and accuracy is poor.
I also tried some other examples - does not work either.
Many people have the same problem, aggregated links under this one:
https://stackoverflow.com/questions/34560697/tesseract-ocr-user-patterns
also #403
Should we assume that this feature does not work at all? Is there any official comment on this?
The text was updated successfully, but these errors were encountered: