Tesseract fails to process a large image with missing resolution data #756

stweil · 2017-03-10T20:32:35Z

The original image in JPEG 2000 format includes two pages from a newspaper. This image is processed correctly by latest Tesseract. Tesseract fails with the same image in TIFF, JPEG or PNG format and reports two empty pages.

This happens also with older versions of Tesseract.

jbreiden · 2017-03-10T20:41:26Z

checking...

jbreiden · 2017-03-10T20:50:43Z

Quick side note. It's good to see that there is resolution metadata in the JP2. Remember to carry that over to other formats during conversion. It did not make it to the PNG file.

$ jhove ~/Downloads/0604.jp2  | grep -i sampling
      SamplingFrequencyUnit: centimeter
      XSamplingFrequency: 118.11
      YSamplingFrequency: 118.11

jbreiden · 2017-03-10T21:01:35Z

Tesseract is able to find text when resolution metadata is properly set. Result is 54 megabytes, so a little too big to attach. But it works and you should be able to reproduce.

$ mogrify -density 300x300 -units PixelsPerInch 0604.png
$ tesseract -l ger 0604.png 0604 pdf

stweil · 2017-03-10T21:15:59Z

Do you think that Tesseract could handle missing resolution information in a more user friendly way? I created the test images using convert 0604.jp2 0604.png (or similar for other formats). I could imagine Tesseract trying 300 dpi in addition to the 70 dpi which it claims to use:

tesseract 0604.png /tmp/0604-png
Info in bmfCreate: Generating pixa of bitmap fonts from string
Tesseract Open Source OCR Engine v4.00.00alpha-332-g4c5d0b5 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Empty page!!
Empty page!!

Does 70 dpi as default value make sense at all? And why does the resolution matter? Will Tesseract detect only characters of a certain size?

jbreiden · 2017-03-10T22:02:15Z

Maybe 70 was chosen because it was screen resolution back when dinosaurs walked the earth, and Tesseract was first written? Why does resolution matter? I'm guessing there are complicated heuristics somewhere in the code that tries to guess at likely font sizes. For example, if I crop out a small piece of the newspaper and set to 0 dpi, we get results. Sounds like investigation is needed. Or we can ask Ray.

PS. Irrespective of this bug, try to use good hygiene with resolution metadata. Maybe some day later you'll want to know what size the fonts are. Or something where you might regret losing the resolution metadata. I've seen it happen far too many times.

$ tesseract -l ger_old /tmp/foo.png -
Warning. Invalid resolution 0 dpi. Using 70 instead.
Magdeburg. [55996]

In das iit heute
bei der _ unter RNr. 151 verzeichneten
Fort-
fchritt, eingetragene
mit befohräufter Heofipflicht' in Dl.
venftedt eingetragen worden: Die Ge-
nofenfhaft ift durd BVefhluf der Ge-
neralverfammlung vom 16. Uuguft 1920
aufgelöft. Anuguft Üterwedde und Leo
Krötfi, beide in Olvenfiedt, find zu
Liquidatoren bejielt.

Magdeburg, dem 19. AÄuguft 1920.
OVa& IAmtenericht A A

stweil · 2017-04-23T12:51:47Z

Or we can ask Ray.

@theraysmith, the current code includes a hard coded value of 70 dpi as the minimum resolution and sets any resolution which is smaller to that value. This is also done for images which don't include a resolution information ("0 dpi"). Maybe it would be better to assume 300 dpi for that special case. Why does the resolution matter at all?

amitdo · 2017-04-23T13:16:09Z

IMO, assuming Tesseract really needs to know the resolution, when the dpi is absent or seems suspicious, the program should not try to guess the dpi and ocr the page. It should just print an error message.

stweil · 2017-04-23T13:53:28Z

Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too.

If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata.

theraysmith · 2017-04-25T01:10:18Z

The resolution is only used by layout analysis. It sets the threshold size at which to call possible text so ridiculously small that it can't possibly be text. I.e. it helps to distinguish text from noise. There is also some auto scaling somewhere in the preprocessing to magnify low resolution text that is not needed by the LSTM engine, but is needed by the legacy engine.

…

On Sun, Apr 23, 2017 at 6:53 AM, Stefan Weil ***@***.***> wrote: Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too. If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#756 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056e75bpdwBby3WvUvTDgGbwjSMLBnks5ry1fjgaJpZM4MZ1WT> .

-- Ray.

jbreiden · 2017-04-25T02:25:58Z

Sounds like implementation is disagreeing with intention. Where's the layout analysis resolution code?

amitdo · 2017-04-25T12:20:12Z

https://github.com/tesseract-ocr/tesseract/search?q=resolution

theraysmith · 2017-04-25T15:36:54Z

At the time that 70 minimum was set, it was a question of how to cover the most probable case.
Back in the day when most inputs came from a flatbed scanner, the resolution was provided.
Most images that did not have a resolution were screenshots at ~70ppi. In theory processing a 300ppi image at 70 should be less damaging to accuracy than processing a 300ppi image at 70, but that seems not the case in the original post. It would be worth taking a look at why that happens. There might be an easy fix.
Incidentally, most monitors today still give you not much more than 70ppi, (maybe 150) but they give you a bigger screen with even more small text on it. Only phones manage ~300ppi and maybe my new laptop, which has more pixels than my 24" monitors in less than half the area.

Now, when a lot of images come from camera phones, the resolution is largely unknown, and layout analysis requires some more work.

Incidentally, there is an easy way in to set resolution. Set it in the Pix before passing it to TessBaseAPI.

theraysmith · 2017-04-26T20:01:45Z

I now have a reasonably general fix for the resolution issue. There are multiple unsolved problems with the original 0604 image though: There are large gaps between words, but tiny gaps between columns. That was causing column finding to fail, causing the blank page determination. The problem is that it sees the large gaps between words, which at 70 ppi look huge, and decides that it shouldn't merge them into textlines. Although that should be fixed, it is a highly dangerous thing to try without very careful testing. The columns aren't straight. The layout analysis is fundamentally broken in such cases. It can't cut a straight line (even at an angle) through the very narrow bent gap between columns. A general fix for resolution is to estimate the resolution based on the measured body text size, which is available before the column finder is constructed. That makes for an easy fix. On the original 0604 image, it estimates the resolution to be 470 ppi but still generates a poor layout analysis, due to the above problems.

…

On Tue, Apr 25, 2017 at 5:20 AM, Amit D. ***@***.***> wrote: https://github.com/tesseract-ocr/tesseract/search?q=resolution — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#756 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056WiqW85XlvkQ5YbrmK2x46GsQvNEks5rzeUIgaJpZM4MZ1WT> .

-- Ray.

stweil · 2017-04-27T04:58:43Z

Many thanks for this analysis and your efforts.

jeroen · 2017-06-09T18:53:51Z

+1 this is biting me as well. I have a small demo which was working a year ago, but now it is giving:

> text <- ocr("http://jeroen.github.io/files/inlove.png")
Warning. Invalid resolution 0 dpi. Using 70 instead.
Too few characters. Skipping this page

I guess the problem is that the default resolution is too low?

brlin-tw · 2018-01-06T18:36:03Z

Hello. I'd like to propose to add a command-line option for user to manually specify the DPI as usually the person who scanned the image should know what DPI it is used.

jbreiden · 2018-01-06T20:27:14Z

Why not use a command line tool that modifies the dpi of an image file? Like mogrify -density.

brlin-tw · 2018-01-06T20:51:58Z

@jbreiden
Thanks for the pointers, in fact I have no idea of the morgify command in the first place. I still feel that it should be viable and simplistic to specify an parameter unknown by the program directly though.

jbreiden · 2018-01-07T17:50:56Z

An override in Tesseract would produce PDF output with inconsistent metadata. The image's embedded resolution would disagree with the PDF image object metadata.

brlin-tw · 2018-01-07T19:06:35Z

What if we only allow overriding the DPI when the image doesn't have any embedded resolution info so the inconsistency won't occur?

amitdo · 2018-01-07T19:30:27Z

... when the image doesn't have any embedded resolution info

a18620c

jbreiden · 2018-01-07T23:57:23Z

@amitdo is pointing out we already do exactly this. Which is a pretty good point. I still don't like it though. Somebody somewhere is inevitably going to build a document scanning product with this code, outputting PDF. Then someone else is going to re-OCR that data by extracting the images. And it will all go down hill due to missing or inconstant resolution metadata. I've had so many problems with this sort of thing in life that I definitely don't want to encourage inconsistency. But I'm just one person with an opinion, and reasonable people can disagree.

stweil · 2018-01-08T08:13:20Z

Should Tesseract simply refuse to handle images without resolution metadata (instead of guessing the resolution and potentially producing wrong results)? That would solve my reported problem, too.

jbreiden · 2018-01-08T18:42:05Z

@stweil It's tempting. Let's think about this. We would lose the ability to OCR certain types of image files that don't support resolution metadata, like pnm. And it's a little hard to predict the chaos this might cause in the 341 packages that now depend on Tesseract. A possible compromise is to make PDF output fail when image metadata is unset, since that's the most problematic scenario. Honestly I'm not sure what is best.

asmaier · 2018-02-23T12:49:40Z

If you a just searching for a workaround with the Java API of Tesseract, try this:

import static org.bytedeco.javacpp.tesseract.TessBaseAPI;

TessBaseAPI api = init();
tesseract.TessBaseAPISetImage2(api, image);
tesseract.TessBaseAPISetSourceResolution(api, 70);

This will simply set the resolution of the image to 70dpi .
See https://stackoverflow.com/questions/47268601/suppress-warning-on-console-when-using-tess4j-for-ocring .

zdenop · 2018-09-28T07:31:33Z

IMO there are 2 ways how we can easily improve situation:

implement kMinCredibleResolution as parameter that user can modify
implement option for tesseract app to set dpi (e.g. --dpi 300) to input image (with pixSetResolution)

zdenop · 2018-09-28T18:36:17Z

345e5ee commit allow user to specify dpi. e.g.
tesseract 0604.jp2 0604_jp2 -l deu --dpi 300 pdf

brlin-tw · 2018-09-28T18:51:55Z

I assume it is commit a0564fd?

zdenop · 2018-09-28T18:55:11Z

Yes - I copied wrong commit ;-)

jbreiden closed this as completed Mar 10, 2017

stweil changed the title ~~Tesseract fails to process a large image (depending on image format)~~ Tesseract fails to process a large image with missing resolution data Mar 10, 2017

jbreiden reopened this Mar 10, 2017

amitdo mentioned this issue Aug 9, 2017

Change default resolution from 70 to 300 dpi #1070

Merged

zdenop closed this as completed Sep 28, 2018

amitdo mentioned this issue Sep 29, 2018

Don't use DPI as a way to refer to word size in documentation #1846

Closed

amitdo added the image resolution label Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract fails to process a large image with missing resolution data #756

Tesseract fails to process a large image with missing resolution data #756

stweil commented Mar 10, 2017

jbreiden commented Mar 10, 2017

jbreiden commented Mar 10, 2017

jbreiden commented Mar 10, 2017

stweil commented Mar 10, 2017 •

edited

Loading

jbreiden commented Mar 10, 2017 •

edited

Loading

stweil commented Apr 23, 2017

amitdo commented Apr 23, 2017 •

edited

Loading

stweil commented Apr 23, 2017

theraysmith commented Apr 25, 2017 via email

jbreiden commented Apr 25, 2017

amitdo commented Apr 25, 2017

theraysmith commented Apr 25, 2017

theraysmith commented Apr 26, 2017 via email

stweil commented Apr 27, 2017

jeroen commented Jun 9, 2017

brlin-tw commented Jan 6, 2018 •

edited

Loading

jbreiden commented Jan 6, 2018

brlin-tw commented Jan 6, 2018 •

edited

Loading

jbreiden commented Jan 7, 2018

brlin-tw commented Jan 7, 2018

amitdo commented Jan 7, 2018

jbreiden commented Jan 7, 2018

stweil commented Jan 8, 2018

jbreiden commented Jan 8, 2018

asmaier commented Feb 23, 2018

zdenop commented Sep 28, 2018

zdenop commented Sep 28, 2018

brlin-tw commented Sep 28, 2018

zdenop commented Sep 28, 2018

Tesseract fails to process a large image with missing resolution data #756

Tesseract fails to process a large image with missing resolution data #756

Comments

stweil commented Mar 10, 2017

jbreiden commented Mar 10, 2017

jbreiden commented Mar 10, 2017

jbreiden commented Mar 10, 2017

stweil commented Mar 10, 2017 • edited Loading

jbreiden commented Mar 10, 2017 • edited Loading

stweil commented Apr 23, 2017

amitdo commented Apr 23, 2017 • edited Loading

stweil commented Apr 23, 2017

theraysmith commented Apr 25, 2017 via email

jbreiden commented Apr 25, 2017

amitdo commented Apr 25, 2017

theraysmith commented Apr 25, 2017

theraysmith commented Apr 26, 2017 via email

stweil commented Apr 27, 2017

jeroen commented Jun 9, 2017

brlin-tw commented Jan 6, 2018 • edited Loading

jbreiden commented Jan 6, 2018

brlin-tw commented Jan 6, 2018 • edited Loading

jbreiden commented Jan 7, 2018

brlin-tw commented Jan 7, 2018

amitdo commented Jan 7, 2018

jbreiden commented Jan 7, 2018

stweil commented Jan 8, 2018

jbreiden commented Jan 8, 2018

asmaier commented Feb 23, 2018

zdenop commented Sep 28, 2018

zdenop commented Sep 28, 2018

brlin-tw commented Sep 28, 2018

zdenop commented Sep 28, 2018

stweil commented Mar 10, 2017 •

edited

Loading

jbreiden commented Mar 10, 2017 •

edited

Loading

amitdo commented Apr 23, 2017 •

edited

Loading

brlin-tw commented Jan 6, 2018 •

edited

Loading

brlin-tw commented Jan 6, 2018 •

edited

Loading