Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract fails to process a large image with missing resolution data #756

Closed
stweil opened this issue Mar 10, 2017 · 29 comments
Closed

Tesseract fails to process a large image with missing resolution data #756

stweil opened this issue Mar 10, 2017 · 29 comments

Comments

@stweil
Copy link
Member

stweil commented Mar 10, 2017

The original image in JPEG 2000 format includes two pages from a newspaper. This image is processed correctly by latest Tesseract. Tesseract fails with the same image in TIFF, JPEG or PNG format and reports two empty pages.

This happens also with older versions of Tesseract.

@jbreiden
Copy link
Contributor

checking...

@jbreiden
Copy link
Contributor

Quick side note. It's good to see that there is resolution metadata in the JP2. Remember to carry that over to other formats during conversion. It did not make it to the PNG file.

$ jhove ~/Downloads/0604.jp2  | grep -i sampling
      SamplingFrequencyUnit: centimeter
      XSamplingFrequency: 118.11
      YSamplingFrequency: 118.11

@jbreiden
Copy link
Contributor

Tesseract is able to find text when resolution metadata is properly set. Result is 54 megabytes, so a little too big to attach. But it works and you should be able to reproduce.

$ mogrify -density 300x300 -units PixelsPerInch 0604.png
$ tesseract -l ger 0604.png 0604 pdf

@stweil
Copy link
Member Author

stweil commented Mar 10, 2017

Do you think that Tesseract could handle missing resolution information in a more user friendly way? I created the test images using convert 0604.jp2 0604.png (or similar for other formats). I could imagine Tesseract trying 300 dpi in addition to the 70 dpi which it claims to use:

tesseract 0604.png /tmp/0604-png
Info in bmfCreate: Generating pixa of bitmap fonts from string
Tesseract Open Source OCR Engine v4.00.00alpha-332-g4c5d0b5 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Empty page!!
Empty page!!

Does 70 dpi as default value make sense at all? And why does the resolution matter? Will Tesseract detect only characters of a certain size?

@stweil stweil changed the title Tesseract fails to process a large image (depending on image format) Tesseract fails to process a large image with missing resolution data Mar 10, 2017
@jbreiden
Copy link
Contributor

jbreiden commented Mar 10, 2017

Maybe 70 was chosen because it was screen resolution back when dinosaurs walked the earth, and Tesseract was first written? Why does resolution matter? I'm guessing there are complicated heuristics somewhere in the code that tries to guess at likely font sizes. For example, if I crop out a small piece of the newspaper and set to 0 dpi, we get results. Sounds like investigation is needed. Or we can ask Ray.

PS. Irrespective of this bug, try to use good hygiene with resolution metadata. Maybe some day later you'll want to know what size the fonts are. Or something where you might regret losing the resolution metadata. I've seen it happen far too many times.

$ tesseract -l ger_old /tmp/foo.png -
Warning. Invalid resolution 0 dpi. Using 70 instead.
Magdeburg. [55996]

In das iit heute
bei der _ unter RNr. 151 verzeichneten
Fort-
fchritt, eingetragene
mit befohräufter Heofipflicht' in Dl.
venftedt eingetragen worden: Die Ge-
nofenfhaft ift durd BVefhluf der Ge-
neralverfammlung vom 16. Uuguft 1920
aufgelöft. Anuguft Üterwedde und Leo
Krötfi, beide in Olvenfiedt, find zu
Liquidatoren bejielt.

Magdeburg, dem 19. AÄuguft 1920.
OVa& IAmtenericht A A

foo

@jbreiden jbreiden reopened this Mar 10, 2017
@stweil
Copy link
Member Author

stweil commented Apr 23, 2017

Or we can ask Ray.

@theraysmith, the current code includes a hard coded value of 70 dpi as the minimum resolution and sets any resolution which is smaller to that value. This is also done for images which don't include a resolution information ("0 dpi"). Maybe it would be better to assume 300 dpi for that special case. Why does the resolution matter at all?

@amitdo
Copy link
Collaborator

amitdo commented Apr 23, 2017

IMO, assuming Tesseract really needs to know the resolution, when the dpi is absent or seems suspicious, the program should not try to guess the dpi and ocr the page. It should just print an error message.

@stweil
Copy link
Member Author

stweil commented Apr 23, 2017

Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too.

If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata.

@theraysmith
Copy link
Contributor

theraysmith commented Apr 25, 2017 via email

@jbreiden
Copy link
Contributor

Sounds like implementation is disagreeing with intention. Where's the layout analysis resolution code?

@amitdo
Copy link
Collaborator

amitdo commented Apr 25, 2017

@theraysmith
Copy link
Contributor

At the time that 70 minimum was set, it was a question of how to cover the most probable case.
Back in the day when most inputs came from a flatbed scanner, the resolution was provided.
Most images that did not have a resolution were screenshots at ~70ppi. In theory processing a 300ppi image at 70 should be less damaging to accuracy than processing a 300ppi image at 70, but that seems not the case in the original post. It would be worth taking a look at why that happens. There might be an easy fix.
Incidentally, most monitors today still give you not much more than 70ppi, (maybe 150) but they give you a bigger screen with even more small text on it. Only phones manage ~300ppi and maybe my new laptop, which has more pixels than my 24" monitors in less than half the area.

Now, when a lot of images come from camera phones, the resolution is largely unknown, and layout analysis requires some more work.

Incidentally, there is an easy way in to set resolution. Set it in the Pix before passing it to TessBaseAPI.

@theraysmith
Copy link
Contributor

theraysmith commented Apr 26, 2017 via email

@stweil
Copy link
Member Author

stweil commented Apr 27, 2017

Many thanks for this analysis and your efforts.

@jeroen
Copy link
Contributor

jeroen commented Jun 9, 2017

+1 this is biting me as well. I have a small demo which was working a year ago, but now it is giving:

> text <- ocr("http://jeroen.github.io/files/inlove.png")
Warning. Invalid resolution 0 dpi. Using 70 instead.
Too few characters. Skipping this page

I guess the problem is that the default resolution is too low?

@brlin-tw
Copy link
Contributor

brlin-tw commented Jan 6, 2018

Hello. I'd like to propose to add a command-line option for user to manually specify the DPI as usually the person who scanned the image should know what DPI it is used.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 6, 2018

Why not use a command line tool that modifies the dpi of an image file? Like mogrify -density.

@brlin-tw
Copy link
Contributor

brlin-tw commented Jan 6, 2018

@jbreiden
Thanks for the pointers, in fact I have no idea of the morgify command in the first place. I still feel that it should be viable and simplistic to specify an parameter unknown by the program directly though.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 7, 2018

An override in Tesseract would produce PDF output with inconsistent metadata. The image's embedded resolution would disagree with the PDF image object metadata.

@brlin-tw
Copy link
Contributor

brlin-tw commented Jan 7, 2018

What if we only allow overriding the DPI when the image doesn't have any embedded resolution info so the inconsistency won't occur?

@amitdo
Copy link
Collaborator

amitdo commented Jan 7, 2018

... when the image doesn't have any embedded resolution info

a18620c

@jbreiden
Copy link
Contributor

jbreiden commented Jan 7, 2018

@amitdo is pointing out we already do exactly this. Which is a pretty good point. I still don't like it though. Somebody somewhere is inevitably going to build a document scanning product with this code, outputting PDF. Then someone else is going to re-OCR that data by extracting the images. And it will all go down hill due to missing or inconstant resolution metadata. I've had so many problems with this sort of thing in life that I definitely don't want to encourage inconsistency. But I'm just one person with an opinion, and reasonable people can disagree.

@stweil
Copy link
Member Author

stweil commented Jan 8, 2018

Should Tesseract simply refuse to handle images without resolution metadata (instead of guessing the resolution and potentially producing wrong results)? That would solve my reported problem, too.

@jbreiden
Copy link
Contributor

jbreiden commented Jan 8, 2018

@stweil It's tempting. Let's think about this. We would lose the ability to OCR certain types of image files that don't support resolution metadata, like pnm. And it's a little hard to predict the chaos this might cause in the 341 packages that now depend on Tesseract. A possible compromise is to make PDF output fail when image metadata is unset, since that's the most problematic scenario. Honestly I'm not sure what is best.

@asmaier
Copy link

asmaier commented Feb 23, 2018

If you a just searching for a workaround with the Java API of Tesseract, try this:

import static org.bytedeco.javacpp.tesseract.TessBaseAPI;

TessBaseAPI api = init();
tesseract.TessBaseAPISetImage2(api, image);
tesseract.TessBaseAPISetSourceResolution(api, 70);

This will simply set the resolution of the image to 70dpi .
See https://stackoverflow.com/questions/47268601/suppress-warning-on-console-when-using-tess4j-for-ocring .

@zdenop
Copy link
Contributor

zdenop commented Sep 28, 2018

IMO there are 2 ways how we can easily improve situation:

  1. implement kMinCredibleResolution as parameter that user can modify
  2. implement option for tesseract app to set dpi (e.g. --dpi 300) to input image (with pixSetResolution)

@zdenop
Copy link
Contributor

zdenop commented Sep 28, 2018

345e5ee commit allow user to specify dpi. e.g.
tesseract 0604.jp2 0604_jp2 -l deu --dpi 300 pdf

@zdenop zdenop closed this as completed Sep 28, 2018
@brlin-tw
Copy link
Contributor

I assume it is commit a0564fd?

@zdenop
Copy link
Contributor

zdenop commented Sep 28, 2018

Yes - I copied wrong commit ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants