Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow OCR with tesseract 4.00alpha #285

Closed
innir opened this issue Jan 5, 2018 · 30 comments
Closed

Slow OCR with tesseract 4.00alpha #285

innir opened this issue Jan 5, 2018 · 30 comments

Comments

@innir
Copy link
Contributor

innir commented Jan 5, 2018

Hi,

as tesseract 4.00-dev was uploaded to Debian (why is another question), gImageReader was rebuild against it. Now the recognition speed is about 10 times (!) lower than with tesseract 3.05. Using the command line tesseract does not show this slow speed.
Do you have any idea what goes wrong here? I'll happily test patches :-)

Best,
Philip

@manisandro
Copy link
Owner

Uh no idea, there are actually only three calls to tesseract really, see [1], so I'm not doing anything terribly fancy client side. Might be worth asking upstream.

[1] https://github.com/manisandro/gImageReader/blob/master/gtk/src/Recognizer.cc#L490

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

Hm strange, the difference between command line tesseract and gImageReader is even bigger 2.4s vs. 1:30 min! Both use all available CPU cores. Yes, asking upstream seems reasonable. Probably someone should double check first?

@manisandro
Copy link
Owner

Are you testing in plain text or hOCR mode or both? Any difference between the two?

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

Plain text before, now hOCR: tesseract: 2.7s vs. gImageReader 1:11 min

@manisandro
Copy link
Owner

:\ That's pretty disasterous

@manisandro
Copy link
Owner

Can you try playing with the page segmentation modes in the menu of the recognition button?

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

No changes in speed, but option two (from top) gives a crash: https://pastebin.com/Xwxqxbes while tesseract --psm 1 ... doesn't crash

On the console:

Error: Illegal Parameter specification!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75

@manisandro
Copy link
Owner

I'll build the latest tesseract git and do some testing.

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

The main difference I see is that tesseract 4.00 uses multiple threads while 3.04 does not.
I can also confirm that gImagerader 3.2.0 build against tessarect 3.04 does not show this speed decrease. command line and gImageReader are both ~12s (different computer).

@manisandro
Copy link
Owner

There are indeed reports of openmp causing slowdowns, i.e. tesseract-ocr/tesseract#961

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

Well, but that doesn't explains why tesseract 4.00. with openmp is fast for me

@manisandro
Copy link
Owner

Uhm, so testing latest gimagereader git and tesseract-4.00.gitbc668da, recognizing a 10 page document in hOCR mode takes 46 seconds, while 3.05.00 takes 38 seconds. So not that much of a slowdown. Tesseract compiled without passing any particular options to configure, which resulted in an OpenMP-enabled but OpenCL-disabled build. But I only see 2/8 cores fully used.

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

Using upstreams tesseract without modification didn't change anything, I used gImageReader 3.2.3 though, it uses all my 4 cores - using the gtk version

@manisandro
Copy link
Owner

Does it depend on the language used? Any particularly complex document? Can you try with latest gimagereader-git?

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

No, language doesn't change anything. The test image is here: https://imgur.com/a/dkpHz - not complex at all.
I'll try to build gimagereader-git

@manisandro
Copy link
Owner

Hmm processes it in about 3secs here

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

latest git doesn't change anything. but setting Contrast to 25 reduces the time to ~4 secs

@manisandro
Copy link
Owner

Which leptonica version do you have?

@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

leptonica 1.74.4

Strange, applying contrast (e.g. removing the gray background) with gimp doesn't change the recognition time. But changing anything in the image processing tab of gimagereader reduces the time to ~4 secs.
Even doing it once changes the time for subsequent images. So maybe some missing initialization somewhere?

@manisandro
Copy link
Owner

The image processing code uses OpenMP itself [1] (and is actually the only part of gImageReader itself that does). I'm not sufficiently knowledgeable about the OpenMP internals, but it looks like merely executing that OpenMP block before tesseract does somehow affects how the OpenMP blocks are run in tesseract. To confirm this theory, you could try adding a dummy loop somewhere in main.cpp like

#pragma omp parallel for schedule(static)
for(int i = 0; i < 4; ++i) { std::cout << "Hello" << std::endl; }

and see whether it affects tesseract performance. Also, would be interesting to see how many threads are used in tesseract with/without the OpenMP blocks executing in gImageReader.

[1] https://github.com/manisandro/gImageReader/blob/master/gtk/src/DisplayRenderer.cc#L41

innir added a commit to innir/gImageReader that referenced this issue Jan 5, 2018
@innir
Copy link
Contributor Author

innir commented Jan 5, 2018

Indeed, that fixes the problem. I pushed a minimal fix here: #286

@manisandro
Copy link
Owner

Thanks - I'll try and make a minimal reproducer and then I'll report the issue upstream. If there is no satisfactory resolution in time for the upcoming 3.3.0 release, I'll add the workaround.

@manisandro
Copy link
Owner

I'm still having trouble actually reproducing this. Can you reproduce the slowdown with a minimal example like the one below? Possibly try adding some random omp-parallelized loop before and after the tess.Recognize call, adding -fopenmp to the compile flags.

// g++ -o test test.cpp $(pkg-config --cflags --libs cairomm-1.0 gdkmm-3.0 gtkmm-3.0 tesseract)

#include <cairomm/cairomm.h>
#include <gdkmm.h>
#include <gtkmm.h>
#include <iostream>
#define USE_STD_NAMESPACE
#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>
#undef USE_STD_NAMESPACE

int main(int argc, char* argv[]) {
    Gtk::Main main;

    Glib::RefPtr<Gdk::Pixbuf> pixbuf = Gdk::Pixbuf::create_from_file("eurotext.tif");
    Cairo::RefPtr<Cairo::ImageSurface> surf = Cairo::ImageSurface::create(Cairo::FORMAT_ARGB32, pixbuf->get_width(), pixbuf->get_height());
    Cairo::RefPtr<Cairo::Context> ctx = Cairo::Context::create(surf);
    Gdk::Cairo::set_source_pixbuf(ctx, pixbuf);
    ctx->paint();

    tesseract::TessBaseAPI tess;
    ETEXT_DESC desc;
    tess.Init(nullptr, "eng");
    tess.SetImage(surf->get_data(), surf->get_width(), surf->get_height(), 4, surf->get_stride());
    tess.Recognize(&desc);

    char* text = tess.GetUTF8Text();
    std::cout << text << std::endl;
    delete[] text;

    return 0;
}

@innir
Copy link
Contributor Author

innir commented Jan 7, 2018

No, I can't reproduce the slowdown with this example, regardless where/if I put a omp-parallelized loop (and linking with -fopenmp).

@Shreeshrii
Copy link
Contributor

Also see tesseract-ocr/tesseract#898 (comment)

@jbarlow83
Copy link

I've observed machines without AVX2 (pre-Haswell, 2013) will run about 10 times slower than Tesseract 3.05, especially when "best" training data is used.

This is reproducible by disabling -mavx2

Details here:
https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=888917;msg=58

@innir
Copy link
Contributor Author

innir commented Feb 24, 2018

Ah, that actually explains it :-) I have a i5-3470 (pre-Haswell) here. Added this info to upstream bug report: tesseract-ocr/tesseract#1278 . Still odd why this simple 'fix' #286 works...

@napasa
Copy link

napasa commented May 31, 2018

I built gImageReader with latest Tess and tested with fast data under Fedora, don not have slow speed issue. But slow OCR occurs when I built gImageReader with latest Tess compile by MinGW for Windows. anyone has idea to solve it ?

@manisandro
Copy link
Owner

Closing since non a gImageReader issue.

@Freredaran
Copy link

The main difference I see is that tesseract 4.00 uses multiple threads while 3.04 does not. I can also confirm that gImagerader 3.2.0 build against tessarect 3.04 does not show this speed decrease. command line and gImageReader are both ~12s (different computer).

After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
In a terminal, type:

export OMP_THREAD_LIMIT=1

If you want to check that you actually are running on one thread, type:

echo $OMP_THREAD_LIMIT

Then run gImageReader:

gimagereader-gtk

Et voilà :o)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants