-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolution information in PNG files is ignored #453
Comments
"Units: Undefined" is not so great. If you set it, things work correctly.
|
But then, why does tesseract behave inconsistently between tif and png when both have |
Haven't had time to look at TIFF, but the PNG behaviour looks right. Spec says we know nothing about image resolution. Common practice from time immemorial is to default to some hopelessly wrong value. I could go trace code to find out what number was used, but honestly this is a garbage in, garbage out situation. Not sure it is worth spending time on. Are you in contact with the authors of the program that is producing the bad metadata? Fixing that is top priority.
|
I think I know why the units are Undefined. pdfsandwich does a 2-step conversion from a PDF page to tif:
Which gives:
And then:
Which results in:
I'll open a ticket with pdfsandwich to add |
|
Yup, asked both in the ticket there. |
Because there's a use of
That said, libav says that it handles png and tiff, if I read it correctly. |
@mbirth I'm the author of ocrmypdf, which is similar to pdfsandwich. <plug>ocrmypdf is extremely carefully in handling of DPI and handles a lot of edge cases that pdfsandwich does not.</plug> It handles your file without issue. @jbreiden I think it would be helpful for tesseract to issue a warning when the DPI is nonsense. Lots of programs don't handle this metadata correctly so it's easy for a workflow to discard it. Wrong DPI isn't just a display/printing issue; in the case of say, scanned maps, losing scale information can change the interpretation. |
That is a very good idea. Hope I remember once the turkey coma wears off. |
This looks like a spot where we should emit the warning, but is not executed. Line 167 in a75ab45
This spot thinks the resolution is 0. tesseract/ccmain/thresholder.cpp Line 175 in 9c7e99b
Oh, oh, maybe here. Line 2226 in 7b5b167
|
Looks like we have kMinCredibleResolution defined in two places. Only the --- tesseract/api/baseapi.cpp 2016-11-07 07:44:03.000000000 -0800
+++ tesseract/api/baseapi.cpp 2016-11-28 11:23:48.000000000 -0800
@@ -2226,6 +2226,8 @@
if (y_res < kMinCredibleResolution || y_res > kMaxCredibleResolution) {
// Use the minimum default resolution, as it is safer to under-estimate
// than over-estimate resolution.
+ tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+ y_res, kMinCredibleResolution);
thresholder_->SetSourceYResolution(kMinCredibleResolution);
}
PageSegMode pageseg_mode =
--- tesseract/ccmain/osdetect.cpp 2016-11-07 07:44:03.000000000 -0800
+++ tesseract/ccmain/osdetect.cpp 2016-11-28 11:31:13.000000000 -0800
@@ -164,8 +164,14 @@
int vertical_y = 1;
tesseract::TabVector_LIST v_lines;
tesseract::TabVector_LIST h_lines;
- int resolution = (kMinCredibleResolution > pixGetXRes(pix)) ?
- kMinCredibleResolution : pixGetXRes(pix);
+ int resolution;
+ if (kMinCredibleResolution > pixGetXRes(pix)) {
+ resolution = kMinCredibleResolution;
+ tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+ pixGetXRes(pix), resolution);
+ } else {
+ resolution = pixGetXRes(pix);
+ }
tesseract::LineFinder::FindAndRemoveLines(resolution, false, pix,
&vertical_x, &vertical_y, |
As per @jbreiden's comment in #373, here's a problem I noticed with tesseract 3.04.01 (from the Ubuntu Yakkety package).
This is the original PDF without text as it is created by my scanner: scanned.pdf
I've used pdfsandwich with the
-debug
flag to get the intermediate files. The image it uses to feed into tesseract is the tif in this tif.zip. And this works just fine. Here's the identify information from that tif:Running
gives me a perfectly fine PDF in format: A4, Portrait (210 x 296 mm).
outputtif.pdf
And now I converted that tif to png with simply:
png.zip
The new png file shows the same resolution information and print size:
However, running
gives me a PDF in format: 900 × 1270 mm paper size.
outputpng.pdf
The text was updated successfully, but these errors were encountered: