Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to use user-words with Tesseract and bazaar config file has no effect #721

Closed
GraemeWellington opened this issue Apr 24, 2019 · 5 comments
Labels

Comments

@GraemeWellington
Copy link

GraemeWellington commented Apr 24, 2019

I am working with the BasicExample.java in the Tesseract sample directory which I am modifying to try and get my user-words file implemented.
All files are present in tessdata [eng.traineddata, eng.user-words, eng.user-patterns, configs/bazaar].
I have the bazaar cofig set up as follows:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns

The BasicExample (code below) compiles ok but when I run it all text in the scanned image is returned whereas I was expecting only the words in the user-words file.

Any help appreciated.

BasicExample.java

import java.nio.*;
import org.bytedeco.javacpp.*;
import org.bytedeco.leptonica.*;
import org.bytedeco.tesseract.*;
import static org.bytedeco.leptonica.global.lept.*;
import static org.bytedeco.tesseract.global.tesseract.*;

public class BasicExample {
    public static void main(String[] args) {
        BytePointer outText;

        TessBaseAPI api = new TessBaseAPI();

		StringGenericVector pars = new StringGenericVector();
        pars.addPut(new STRING("user_words_suffix"));
        pars.addPut(new STRING("load_system_dawg"));
        pars.addPut(new STRING("load_freq_dawg"));
        pars.addPut(new STRING("load_punc_dawg"));
        pars.addPut(new STRING("load_number_dawg"));
        pars.addPut(new STRING("load_unambig_dawg"));
        pars.addPut(new STRING("load_bigram_dawg"));
        StringGenericVector parsValues = new StringGenericVector();
        parsValues.addPut(new STRING("user-words"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));
        parsValues.addPut(new STRING("0"));

//        if (api.Init(null, "eng") != 0) {
		if (api.Init(null, "eng", PSM_AUTO_OSD, (ByteBuffer)null, 0, pars, parsValues, false) != 0) {
		System.err.println("Could not initialize tesseract.");
            System.exit(1);
        }
		api.ReadConfigFile("bazaar");
        // Open input image with leptonica library
        PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif");
        api.SetImage(image);
		api.SetSourceResolution(300);
		
        // Get OCR result
        outText = api.GetUTF8Text();
        System.out.println("OCR output:\n" + outText.getString());

        // Destroy used object and release memory
        api.End();
        outText.deallocate();
        pixDestroy(image);
    }
}
@saudet
Copy link
Member

saudet commented Apr 24, 2019

Do you have a working example in C++?

@GraemeWellington
Copy link
Author

GraemeWellington commented Apr 24, 2019 via email

@saudet
Copy link
Member

saudet commented Apr 24, 2019

Instead of PSM_AUTO_OSD in your code above, try OEM_TESSERACT_ONLY, that should put it in "legacy mode".

@GraemeWellington
Copy link
Author

GraemeWellington commented Apr 24, 2019 via email

@saudet saudet added the question label May 1, 2019
@saudet
Copy link
Member

saudet commented May 1, 2019

Happy to hear it's working! Some settings don't work with LSTM models, and that seems to be a limitation of Tesseract that is not going to get fixed, but if this is important to you, make sure to report upstream.

@saudet saudet closed this as completed May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants