-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to use user-words with Tesseract and bazaar config file has no effect #721
Labels
Comments
Do you have a working example in C++? |
No I do not – I have just looked at relevant docs and code samples.
I think this line has no affect: api.ReadConfigFile("bazaar");
I just did a bit more looking around and found a comment here: tesseract-ocr/tesseract#960
Necklaces<https://github.com/Necklaces> commented on Aug 14, 2018<tesseract-ocr/tesseract#960 (comment)>
We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).
Maybe this functionality not working?
import org.bytedeco.javacpp.*;
import org.bytedeco.leptonica.*;
import org.bytedeco.tesseract.*;
import static org.bytedeco.leptonica.global.lept.*;
import static org.bytedeco.tesseract.global.tesseract.*;
public class BasicExample {
public static void main(String[] args) {
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
if (api.Init(null, "eng") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
api.ReadConfigFile("bazaar");
// Open input image with leptonica library
PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif");
api.SetImage(image);
api.SetSourceResolution(300);
// Get OCR result
outText = api.GetUTF8Text();
System.out.println("OCR output:\n" + outText.getString());
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
}
}
Regards
Graeme Wellington
Pro-Time Building Solutions Pty Ltd
178 Martin Road WALL FLAT SA 5254
Mobile: 0419 808 473
Email: [email protected]<mailto:[email protected]>
[PTBSX-Logo]
From: Samuel Audet <[email protected]>
Sent: Wednesday, April 24, 2019 10:22 AM
To: bytedeco/javacpp-presets <[email protected]>
Cc: Graeme Wellington <[email protected]>; Author <[email protected]>
Subject: Re: [bytedeco/javacpp-presets] Trying to use user-words with Tesseract and bazaar config file has no effect (#721)
Do you have a working example in C++?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#721 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AKB7LDY3NXVWJCHQXWDQ7ETPR6VKDANCNFSM4HH7OEIA>.
|
Instead of |
OK thanks – good progress…
I went back to the basic tesseract command line info and found this gem while googling:
https://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for
To use whitelist in a config file or using the -c tessedit_char_whitelist=... command-line switch, in the newest 4.0 version you will have to set OCR Engine mode to the "Original Tesseract only". This is because the new "Neural nets LSTM" mode doesn't respect the whitelist setting. Example of proper command-line for 4.0 version:
tesseract input_file output_file --oem 0 -c tessedit_char_whitelist=abc123
UPDATE: In newer versions (4.0) there's corrupted eng.traineddata file installed by default by Windows and some Linux installers. Temporary solution is to replace tessdata\eng.traineddata file with one from older version. This file should be about 30MB. Otherwise you'll get Error: "Tesseract couldn't load any languages!" or similar.
share<https://stackoverflow.com/a/49030935>|improve this answer<https://stackoverflow.com/posts/49030935/edit>
edited Nov 13 '18 at 0:34<https://stackoverflow.com/posts/49030935/revisions>
answered Feb 28 '18 at 13:39
<https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz>
[https://lh5.googleusercontent.com/-nVphkEgyyZ0/AAAAAAAAAAI/AAAAAAAAGp0/FkQDUFJ_yQo/photo.jpg?sz=32]<https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz>
Bartłomiej Uliasz<https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz>
12115
https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz
So the first thing I noticed was that my eng.traineddata was 4077Kb [~4Mb] where it should be in the vicinity of 30Mb – I downloaded the first I could find which is about 22Mb.
I then tried the following command line test and I got close to the results I expected!:
"tesseract" "test.png" "BT-TessOCR-XXXX-bazaar" -l eng --psm 3 --dpi 300 --oem OEM_TESSERACT_ONLY "bazaar"
I also confirmed this by adding a whitelist in the bazaar config file [tessedit_char_whitelist 0123456789] and the returned string contained only those numbers.
So going forward would it be a prospect to get the new "Neural nets LSTM" mode TO respect the whitelist setting?
Anyway a good result.
Thanks again saudet for your quick response and accurate analysis!
Regards
Graeme Wellington
Pro-Time Building Solutions Pty Ltd
178 Martin Road WALL FLAT SA 5254
Mobile: 0419 808 473
Email: [email protected]<mailto:[email protected]>
[PTBSX-Logo]
From: Samuel Audet <[email protected]>
Sent: Wednesday, April 24, 2019 11:56 AM
To: bytedeco/javacpp-presets <[email protected]>
Cc: Graeme Wellington <[email protected]>; Author <[email protected]>
Subject: Re: [bytedeco/javacpp-presets] Trying to use user-words with Tesseract and bazaar config file has no effect (#721)
Instead of PSM_AUTO_OSD in your code above, try OEM_TESSERACT_ONLY, that should put it in "legacy mode".
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#721 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AKB7LDYBAZDEGSANGS4OZV3PR7AKPANCNFSM4HH7OEIA>.
|
Happy to hear it's working! Some settings don't work with LSTM models, and that seems to be a limitation of Tesseract that is not going to get fixed, but if this is important to you, make sure to report upstream. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am working with the BasicExample.java in the Tesseract sample directory which I am modifying to try and get my user-words file implemented.
All files are present in tessdata [eng.traineddata, eng.user-words, eng.user-patterns, configs/bazaar].
I have the bazaar cofig set up as follows:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns
The BasicExample (code below) compiles ok but when I run it all text in the scanned image is returned whereas I was expecting only the words in the user-words file.
Any help appreciated.
BasicExample.java
The text was updated successfully, but these errors were encountered: