-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract fails to process a large image with missing resolution data #756
Comments
checking... |
Quick side note. It's good to see that there is resolution metadata in the JP2. Remember to carry that over to other formats during conversion. It did not make it to the PNG file.
|
Tesseract is able to find text when resolution metadata is properly set. Result is 54 megabytes, so a little too big to attach. But it works and you should be able to reproduce.
|
Do you think that Tesseract could handle missing resolution information in a more user friendly way? I created the test images using
Does 70 dpi as default value make sense at all? And why does the resolution matter? Will Tesseract detect only characters of a certain size? |
Maybe 70 was chosen because it was screen resolution back when dinosaurs walked the earth, and Tesseract was first written? Why does resolution matter? I'm guessing there are complicated heuristics somewhere in the code that tries to guess at likely font sizes. For example, if I crop out a small piece of the newspaper and set to 0 dpi, we get results. Sounds like investigation is needed. Or we can ask Ray. PS. Irrespective of this bug, try to use good hygiene with resolution metadata. Maybe some day later you'll want to know what size the fonts are. Or something where you might regret losing the resolution metadata. I've seen it happen far too many times.
|
@theraysmith, the current code includes a hard coded value of 70 dpi as the minimum resolution and sets any resolution which is smaller to that value. This is also done for images which don't include a resolution information ("0 dpi"). Maybe it would be better to assume 300 dpi for that special case. Why does the resolution matter at all? |
IMO, assuming Tesseract really needs to know the resolution, when the dpi is absent or seems suspicious, the program should not try to guess the dpi and ocr the page. It should just print an error message. |
Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too. If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata. |
The resolution is only used by layout analysis.
It sets the threshold size at which to call possible text so ridiculously
small that it can't possibly be text. I.e. it helps to distinguish text
from noise.
There is also some auto scaling somewhere in the preprocessing to magnify
low resolution text that is not needed by the LSTM engine, but is needed by
the legacy engine.
…On Sun, Apr 23, 2017 at 6:53 AM, Stefan Weil ***@***.***> wrote:
Maybe. It is not clear why the dpi information is needed at all. I can
read text of any dpi (just have to adapt the reading distance or get some
glasses) without knowing the actual dpi value, and ideally OCR software can
do that, too.
If the dpi value is important, we need an option to set it for images
without (or with wrong) resolution metadata.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#756 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056e75bpdwBby3WvUvTDgGbwjSMLBnks5ry1fjgaJpZM4MZ1WT>
.
--
Ray.
|
Sounds like implementation is disagreeing with intention. Where's the layout analysis resolution code? |
At the time that 70 minimum was set, it was a question of how to cover the most probable case. Now, when a lot of images come from camera phones, the resolution is largely unknown, and layout analysis requires some more work. Incidentally, there is an easy way in to set resolution. Set it in the Pix before passing it to TessBaseAPI. |
I now have a reasonably general fix for the resolution issue.
There are multiple unsolved problems with the original 0604 image though:
There are large gaps between words, but tiny gaps between columns. That was
causing column finding to fail, causing the blank page determination. The
problem is that it sees the large gaps between words, which at 70 ppi look
huge, and decides that it shouldn't merge them into textlines. Although
that should be fixed, it is a highly dangerous thing to try without very
careful testing.
The columns aren't straight. The layout analysis is fundamentally broken in
such cases. It can't cut a straight line (even at an angle) through the
very narrow bent gap between columns.
A general fix for resolution is to estimate the resolution based on the
measured body text size, which is available before the column finder is
constructed. That makes for an easy fix.
On the original 0604 image, it estimates the resolution to be 470 ppi but
still generates a poor layout analysis, due to the above problems.
…On Tue, Apr 25, 2017 at 5:20 AM, Amit D. ***@***.***> wrote:
https://github.com/tesseract-ocr/tesseract/search?q=resolution
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#756 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056WiqW85XlvkQ5YbrmK2x46GsQvNEks5rzeUIgaJpZM4MZ1WT>
.
--
Ray.
|
Many thanks for this analysis and your efforts. |
+1 this is biting me as well. I have a small demo which was working a year ago, but now it is giving: > text <- ocr("http://jeroen.github.io/files/inlove.png")
Warning. Invalid resolution 0 dpi. Using 70 instead.
Too few characters. Skipping this page I guess the problem is that the default resolution is too low? |
Hello. I'd like to propose to add a command-line option for user to manually specify the DPI as usually the person who scanned the image should know what DPI it is used. |
Why not use a command line tool that modifies the dpi of an image file? Like mogrify -density. |
@jbreiden |
An override in Tesseract would produce PDF output with inconsistent metadata. The image's embedded resolution would disagree with the PDF image object metadata. |
What if we only allow overriding the DPI when the image doesn't have any embedded resolution info so the inconsistency won't occur? |
|
@amitdo is pointing out we already do exactly this. Which is a pretty good point. I still don't like it though. Somebody somewhere is inevitably going to build a document scanning product with this code, outputting PDF. Then someone else is going to re-OCR that data by extracting the images. And it will all go down hill due to missing or inconstant resolution metadata. I've had so many problems with this sort of thing in life that I definitely don't want to encourage inconsistency. But I'm just one person with an opinion, and reasonable people can disagree. |
Should Tesseract simply refuse to handle images without resolution metadata (instead of guessing the resolution and potentially producing wrong results)? That would solve my reported problem, too. |
@stweil It's tempting. Let's think about this. We would lose the ability to OCR certain types of image files that don't support resolution metadata, like pnm. And it's a little hard to predict the chaos this might cause in the 341 packages that now depend on Tesseract. A possible compromise is to make PDF output fail when image metadata is unset, since that's the most problematic scenario. Honestly I'm not sure what is best. |
If you a just searching for a workaround with the Java API of Tesseract, try this:
This will simply set the resolution of the image to 70dpi . |
IMO there are 2 ways how we can easily improve situation:
|
345e5ee commit allow user to specify dpi. e.g. |
I assume it is commit a0564fd? |
Yes - I copied wrong commit ;-) |
The original image in JPEG 2000 format includes two pages from a newspaper. This image is processed correctly by latest Tesseract. Tesseract fails with the same image in TIFF, JPEG or PNG format and reports two empty pages.
This happens also with older versions of Tesseract.
The text was updated successfully, but these errors were encountered: