Slow OCR with tesseract 4.00alpha #285

innir · 2018-01-05T09:25:35Z

Hi,

as tesseract 4.00-dev was uploaded to Debian (why is another question), gImageReader was rebuild against it. Now the recognition speed is about 10 times (!) lower than with tesseract 3.05. Using the command line tesseract does not show this slow speed.
Do you have any idea what goes wrong here? I'll happily test patches :-)

Best,
Philip

manisandro · 2018-01-05T09:32:35Z

Uh no idea, there are actually only three calls to tesseract really, see [1], so I'm not doing anything terribly fancy client side. Might be worth asking upstream.

[1] https://github.com/manisandro/gImageReader/blob/master/gtk/src/Recognizer.cc#L490

innir · 2018-01-05T10:08:40Z

Hm strange, the difference between command line tesseract and gImageReader is even bigger 2.4s vs. 1:30 min! Both use all available CPU cores. Yes, asking upstream seems reasonable. Probably someone should double check first?

manisandro · 2018-01-05T10:10:58Z

Are you testing in plain text or hOCR mode or both? Any difference between the two?

innir · 2018-01-05T10:18:22Z

Plain text before, now hOCR: tesseract: 2.7s vs. gImageReader 1:11 min

manisandro · 2018-01-05T10:22:35Z

:\ That's pretty disasterous

manisandro · 2018-01-05T10:27:06Z

Can you try playing with the page segmentation modes in the menu of the recognition button?

innir · 2018-01-05T10:45:55Z

No changes in speed, but option two (from top) gives a crash: https://pastebin.com/Xwxqxbes while tesseract --psm 1 ... doesn't crash

On the console:

Error: Illegal Parameter specification!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75

manisandro · 2018-01-05T10:54:15Z

I'll build the latest tesseract git and do some testing.

innir · 2018-01-05T10:58:47Z

The main difference I see is that tesseract 4.00 uses multiple threads while 3.04 does not.
I can also confirm that gImagerader 3.2.0 build against tessarect 3.04 does not show this speed decrease. command line and gImageReader are both ~12s (different computer).

manisandro · 2018-01-05T11:03:32Z

There are indeed reports of openmp causing slowdowns, i.e. tesseract-ocr/tesseract#961

innir · 2018-01-05T11:12:45Z

Well, but that doesn't explains why tesseract 4.00. with openmp is fast for me

manisandro · 2018-01-05T12:13:28Z

Uhm, so testing latest gimagereader git and tesseract-4.00.gitbc668da, recognizing a 10 page document in hOCR mode takes 46 seconds, while 3.05.00 takes 38 seconds. So not that much of a slowdown. Tesseract compiled without passing any particular options to configure, which resulted in an OpenMP-enabled but OpenCL-disabled build. But I only see 2/8 cores fully used.

innir · 2018-01-05T12:55:33Z

Using upstreams tesseract without modification didn't change anything, I used gImageReader 3.2.3 though, it uses all my 4 cores - using the gtk version

manisandro · 2018-01-05T13:04:12Z

Does it depend on the language used? Any particularly complex document? Can you try with latest gimagereader-git?

innir · 2018-01-05T13:07:43Z

No, language doesn't change anything. The test image is here: https://imgur.com/a/dkpHz - not complex at all.
I'll try to build gimagereader-git

manisandro · 2018-01-05T13:09:33Z

Hmm processes it in about 3secs here

innir · 2018-01-05T13:23:16Z

latest git doesn't change anything. but setting Contrast to 25 reduces the time to ~4 secs

manisandro · 2018-01-05T13:26:21Z

Which leptonica version do you have?

innir · 2018-01-05T13:30:01Z

leptonica 1.74.4

Strange, applying contrast (e.g. removing the gray background) with gimp doesn't change the recognition time. But changing anything in the image processing tab of gimagereader reduces the time to ~4 secs.
Even doing it once changes the time for subsequent images. So maybe some missing initialization somewhere?

manisandro · 2018-01-05T13:37:03Z

The image processing code uses OpenMP itself [1] (and is actually the only part of gImageReader itself that does). I'm not sufficiently knowledgeable about the OpenMP internals, but it looks like merely executing that OpenMP block before tesseract does somehow affects how the OpenMP blocks are run in tesseract. To confirm this theory, you could try adding a dummy loop somewhere in main.cpp like

#pragma omp parallel for schedule(static)
for(int i = 0; i < 4; ++i) { std::cout << "Hello" << std::endl; }

and see whether it affects tesseract performance. Also, would be interesting to see how many threads are used in tesseract with/without the OpenMP blocks executing in gImageReader.

[1] https://github.com/manisandro/gImageReader/blob/master/gtk/src/DisplayRenderer.cc#L41

innir · 2018-01-05T14:24:04Z

Indeed, that fixes the problem. I pushed a minimal fix here: #286

manisandro · 2018-01-05T14:31:39Z

Thanks - I'll try and make a minimal reproducer and then I'll report the issue upstream. If there is no satisfactory resolution in time for the upcoming 3.3.0 release, I'll add the workaround.

manisandro · 2018-01-06T17:46:03Z

I'm still having trouble actually reproducing this. Can you reproduce the slowdown with a minimal example like the one below? Possibly try adding some random omp-parallelized loop before and after the tess.Recognize call, adding -fopenmp to the compile flags.

// g++ -o test test.cpp $(pkg-config --cflags --libs cairomm-1.0 gdkmm-3.0 gtkmm-3.0 tesseract)

#include <cairomm/cairomm.h>
#include <gdkmm.h>
#include <gtkmm.h>
#include <iostream>
#define USE_STD_NAMESPACE
#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>
#undef USE_STD_NAMESPACE

int main(int argc, char* argv[]) {
    Gtk::Main main;

    Glib::RefPtr<Gdk::Pixbuf> pixbuf = Gdk::Pixbuf::create_from_file("eurotext.tif");
    Cairo::RefPtr<Cairo::ImageSurface> surf = Cairo::ImageSurface::create(Cairo::FORMAT_ARGB32, pixbuf->get_width(), pixbuf->get_height());
    Cairo::RefPtr<Cairo::Context> ctx = Cairo::Context::create(surf);
    Gdk::Cairo::set_source_pixbuf(ctx, pixbuf);
    ctx->paint();

    tesseract::TessBaseAPI tess;
    ETEXT_DESC desc;
    tess.Init(nullptr, "eng");
    tess.SetImage(surf->get_data(), surf->get_width(), surf->get_height(), 4, surf->get_stride());
    tess.Recognize(&desc);

    char* text = tess.GetUTF8Text();
    std::cout << text << std::endl;
    delete[] text;

    return 0;
}

innir · 2018-01-07T12:05:52Z

No, I can't reproduce the slowdown with this example, regardless where/if I put a omp-parallelized loop (and linking with -fopenmp).

Shreeshrii · 2018-01-11T11:21:53Z

Also see tesseract-ocr/tesseract#898 (comment)

jbarlow83 · 2018-02-23T21:30:20Z

I've observed machines without AVX2 (pre-Haswell, 2013) will run about 10 times slower than Tesseract 3.05, especially when "best" training data is used.

This is reproducible by disabling -mavx2

Details here:
https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=888917;msg=58

innir · 2018-02-24T12:01:31Z

Ah, that actually explains it :-) I have a i5-3470 (pre-Haswell) here. Added this info to upstream bug report: tesseract-ocr/tesseract#1278 . Still odd why this simple 'fix' #286 works...

napasa · 2018-05-31T10:00:01Z

I built gImageReader with latest Tess and tested with fast data under Fedora, don not have slow speed issue. But slow OCR occurs when I built gImageReader with latest Tess compile by MinGW for Windows. anyone has idea to solve it ?

manisandro · 2019-07-28T22:15:19Z

Closing since non a gImageReader issue.

Freredaran · 2023-02-18T17:09:23Z

The main difference I see is that tesseract 4.00 uses multiple threads while 3.04 does not. I can also confirm that gImagerader 3.2.0 build against tessarect 3.04 does not show this speed decrease. command line and gImageReader are both ~12s (different computer).

After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me.
In a terminal, type:

export OMP_THREAD_LIMIT=1

If you want to check that you actually are running on one thread, type:

echo $OMP_THREAD_LIMIT

Then run gImageReader:

gimagereader-gtk

Et voilà :o)

innir added a commit to innir/gImageReader that referenced this issue Jan 5, 2018

Run OpenMP once before calling tesseract (fixes: manisandro#285)

122ed08

innir mentioned this issue Feb 24, 2018

Significant speed drop on Tesseract4 vs 3 with identical image tesseract-ocr/tesseract#1278

Closed

manisandro mentioned this issue Jun 13, 2018

Windows gImageReader Version is three times slower than that of linux #345

Closed

manisandro closed this as completed Jul 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow OCR with tesseract 4.00alpha #285

Slow OCR with tesseract 4.00alpha #285

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 •

edited

Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 •

edited

Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 •

edited

Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 •

edited

Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

manisandro commented Jan 6, 2018

innir commented Jan 7, 2018

Shreeshrii commented Jan 11, 2018

jbarlow83 commented Feb 23, 2018

innir commented Feb 24, 2018

napasa commented May 31, 2018

manisandro commented Jul 28, 2019

Freredaran commented Feb 18, 2023

Slow OCR with tesseract 4.00alpha #285

Slow OCR with tesseract 4.00alpha #285

Comments

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 • edited Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 • edited Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 • edited Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018 • edited Loading

manisandro commented Jan 5, 2018

innir commented Jan 5, 2018

manisandro commented Jan 5, 2018

manisandro commented Jan 6, 2018

innir commented Jan 7, 2018

Shreeshrii commented Jan 11, 2018

jbarlow83 commented Feb 23, 2018

innir commented Feb 24, 2018

napasa commented May 31, 2018

manisandro commented Jul 28, 2019

Freredaran commented Feb 18, 2023

innir commented Jan 5, 2018 •

edited

Loading

innir commented Jan 5, 2018 •

edited

Loading

innir commented Jan 5, 2018 •

edited

Loading

innir commented Jan 5, 2018 •

edited

Loading