Trying to use user-words with Tesseract and bazaar config file has no effect #721

GraemeWellington · 2019-04-24T00:45:08Z

I am working with the BasicExample.java in the Tesseract sample directory which I am modifying to try and get my user-words file implemented.
All files are present in tessdata [eng.traineddata, eng.user-words, eng.user-patterns, configs/bazaar].
I have the bazaar cofig set up as follows:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns

The BasicExample (code below) compiles ok but when I run it all text in the scanned image is returned whereas I was expecting only the words in the user-words file.

Any help appreciated.

BasicExample.java

import java.nio.*;
import org.bytedeco.javacpp.*;
import org.bytedeco.leptonica.*;
import org.bytedeco.tesseract.*;
import static org.bytedeco.leptonica.global.lept.*;
import static org.bytedeco.tesseract.global.tesseract.*;

public class BasicExample {
    public static void main(String[] args) {
        BytePointer outText;

        TessBaseAPI api = new TessBaseAPI();

		StringGenericVector pars = new StringGenericVector();
        pars.addPut(new STRING("user_words_suffix"));
        pars.addPut(new STRING("load_system_dawg"));
        pars.addPut(new STRING("load_freq_dawg"));
        pars.addPut(new STRING("load_punc_dawg"));
        pars.addPut(new STRING("load_number_dawg"));
        pars.addPut(new STRING("load_unambig_dawg"));
        pars.addPut(new STRING("load_bigram_dawg"));
        StringGenericVector parsValues = new StringGenericVector();
        parsValues.addPut(new STRING("user-words"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));

//        if (api.Init(null, "eng") != 0) {
		if (api.Init(null, "eng", PSM_AUTO_OSD, (ByteBuffer)null, 0, pars, parsValues, false) != 0) {
		System.err.println("Could not initialize tesseract.");
            System.exit(1);
        }
		api.ReadConfigFile("bazaar");
        // Open input image with leptonica library
        PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif");
        api.SetImage(image);
		api.SetSourceResolution(300);
		
        // Get OCR result
        outText = api.GetUTF8Text();
        System.out.println("OCR output:\n" + outText.getString());

        // Destroy used object and release memory
        api.End();
        outText.deallocate();
        pixDestroy(image);
    }
}

saudet · 2019-04-24T00:51:41Z

Do you have a working example in C++?

GraemeWellington · 2019-04-24T02:03:49Z

No I do not – I have just looked at relevant docs and code samples. I think this line has no affect: api.ReadConfigFile("bazaar"); I just did a bit more looking around and found a comment here: tesseract-ocr/tesseract#960 Necklaces<https://github.com/Necklaces> commented on Aug 14, 2018<tesseract-ocr/tesseract#960 (comment)> We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0). Maybe this functionality not working? import org.bytedeco.javacpp.*; import org.bytedeco.leptonica.*; import org.bytedeco.tesseract.*; import static org.bytedeco.leptonica.global.lept.*; import static org.bytedeco.tesseract.global.tesseract.*; public class BasicExample { public static void main(String[] args) { BytePointer outText; TessBaseAPI api = new TessBaseAPI(); if (api.Init(null, "eng") != 0) { System.err.println("Could not initialize tesseract."); System.exit(1); } api.ReadConfigFile("bazaar"); // Open input image with leptonica library PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif"); api.SetImage(image); api.SetSourceResolution(300); // Get OCR result outText = api.GetUTF8Text(); System.out.println("OCR output:\n" + outText.getString()); // Destroy used object and release memory api.End(); outText.deallocate(); pixDestroy(image); } } Regards Graeme Wellington Pro-Time Building Solutions Pty Ltd 178 Martin Road WALL FLAT SA 5254 Mobile: 0419 808 473 Email: [email protected]<mailto:[email protected]> [PTBSX-Logo] From: Samuel Audet <[email protected]> Sent: Wednesday, April 24, 2019 10:22 AM To: bytedeco/javacpp-presets <[email protected]> Cc: Graeme Wellington <[email protected]>; Author <[email protected]> Subject: Re: [bytedeco/javacpp-presets] Trying to use user-words with Tesseract and bazaar config file has no effect (#721) Do you have a working example in C++? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#721 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AKB7LDY3NXVWJCHQXWDQ7ETPR6VKDANCNFSM4HH7OEIA>.

saudet · 2019-04-24T02:25:38Z

Instead of PSM_AUTO_OSD in your code above, try OEM_TESSERACT_ONLY, that should put it in "legacy mode".

GraemeWellington · 2019-04-24T03:37:05Z

OK thanks – good progress… I went back to the basic tesseract command line info and found this gem while googling: https://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for To use whitelist in a config file or using the -c tessedit_char_whitelist=... command-line switch, in the newest 4.0 version you will have to set OCR Engine mode to the "Original Tesseract only". This is because the new "Neural nets LSTM" mode doesn't respect the whitelist setting. Example of proper command-line for 4.0 version: tesseract input_file output_file --oem 0 -c tessedit_char_whitelist=abc123 UPDATE: In newer versions (4.0) there's corrupted eng.traineddata file installed by default by Windows and some Linux installers. Temporary solution is to replace tessdata\eng.traineddata file with one from older version. This file should be about 30MB. Otherwise you'll get Error: "Tesseract couldn't load any languages!" or similar. share<https://stackoverflow.com/a/49030935>|improve this answer<https://stackoverflow.com/posts/49030935/edit> edited Nov 13 '18 at 0:34<https://stackoverflow.com/posts/49030935/revisions> answered Feb 28 '18 at 13:39 <https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz> [https://lh5.googleusercontent.com/-nVphkEgyyZ0/AAAAAAAAAAI/AAAAAAAAGp0/FkQDUFJ_yQo/photo.jpg?sz=32]<https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz> Bartłomiej Uliasz<https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz> 12115 https://stackoverflow.com/users/7926219/bart%c5%82omiej-uliasz So the first thing I noticed was that my eng.traineddata was 4077Kb [~4Mb] where it should be in the vicinity of 30Mb – I downloaded the first I could find which is about 22Mb. I then tried the following command line test and I got close to the results I expected!: "tesseract" "test.png" "BT-TessOCR-XXXX-bazaar" -l eng --psm 3 --dpi 300 --oem OEM_TESSERACT_ONLY "bazaar" I also confirmed this by adding a whitelist in the bazaar config file [tessedit_char_whitelist 0123456789] and the returned string contained only those numbers. So going forward would it be a prospect to get the new "Neural nets LSTM" mode TO respect the whitelist setting? Anyway a good result. Thanks again saudet for your quick response and accurate analysis! Regards Graeme Wellington Pro-Time Building Solutions Pty Ltd 178 Martin Road WALL FLAT SA 5254 Mobile: 0419 808 473 Email: [email protected]<mailto:[email protected]> [PTBSX-Logo] From: Samuel Audet <[email protected]> Sent: Wednesday, April 24, 2019 11:56 AM To: bytedeco/javacpp-presets <[email protected]> Cc: Graeme Wellington <[email protected]>; Author <[email protected]> Subject: Re: [bytedeco/javacpp-presets] Trying to use user-words with Tesseract and bazaar config file has no effect (#721) Instead of PSM_AUTO_OSD in your code above, try OEM_TESSERACT_ONLY, that should put it in "legacy mode". — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#721 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AKB7LDYBAZDEGSANGS4OZV3PR7AKPANCNFSM4HH7OEIA>.

saudet · 2019-05-01T14:33:42Z

Happy to hear it's working! Some settings don't work with LSTM models, and that seems to be a limitation of Tesseract that is not going to get fixed, but if this is important to you, make sure to report upstream.

saudet added the question label May 1, 2019

saudet closed this as completed May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to use user-words with Tesseract and bazaar config file has no effect #721

Trying to use user-words with Tesseract and bazaar config file has no effect #721

GraemeWellington commented Apr 24, 2019 •

edited by saudet

Loading

saudet commented Apr 24, 2019

GraemeWellington commented Apr 24, 2019 via email

saudet commented Apr 24, 2019

GraemeWellington commented Apr 24, 2019 via email

saudet commented May 1, 2019

Trying to use user-words with Tesseract and bazaar config file has no effect #721

Trying to use user-words with Tesseract and bazaar config file has no effect #721

Comments

GraemeWellington commented Apr 24, 2019 • edited by saudet Loading

saudet commented Apr 24, 2019

GraemeWellington commented Apr 24, 2019 via email

saudet commented Apr 24, 2019

GraemeWellington commented Apr 24, 2019 via email

saudet commented May 1, 2019

GraemeWellington commented Apr 24, 2019 •

edited by saudet

Loading