Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolution information in PNG files is ignored #453

Closed
mbirth opened this issue Oct 28, 2016 · 11 comments
Closed

Resolution information in PNG files is ignored #453

mbirth opened this issue Oct 28, 2016 · 11 comments
Labels

Comments

@mbirth
Copy link

mbirth commented Oct 28, 2016

As per @jbreiden's comment in #373, here's a problem I noticed with tesseract 3.04.01 (from the Ubuntu Yakkety package).

This is the original PDF without text as it is created by my scanner: scanned.pdf

I've used pdfsandwich with the -debug flag to get the intermediate files. The image it uses to feed into tesseract is the tif in this tif.zip. And this works just fine. Here's the identify information from that tif:

Image: extractedtif.tif
  Format: TIFF (Tagged Image File Format)
  Mime type: image/tiff
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: LSB
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

Running

tesseract extractedtif.tif outputtif -l deu pdf

gives me a perfectly fine PDF in format: A4, Portrait (210 x 296 mm).

outputtif.pdf

And now I converted that tif to png with simply:

convert extractedtif.tif pngfromtif.png

png.zip

The new png file shows the same resolution information and print size:

Image: pngfromtif.png
  Format: PNG (Portable Network Graphics)
  Mime type: image/png
  Class: PseudoClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: Undefined
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

However, running

tesseract pngfromtif.png outputpng -l deu pdf

gives me a PDF in format: 900 × 1270 mm paper size.

outputpng.pdf

@jbreiden
Copy link
Contributor

"Units: Undefined" is not so great. If you set it, things work correctly.
Will have to look more carefully what the unit possibilities are for PDF to
see if we want to make a code change or not.

$ mogrify -units PixelsPerInch --density 300x300 pngfromtif.png
$ identify -verbose pngfromtif.png
  Geometry: 2479x3500+0+0
  Resolution: 118.11x118.11
  Print size: 20.9889x29.6334
  Units: PixelsPerCentimeter

$ tesseract  pngfromtif.png correct pdf
$ pdfinfo correct.pdf
Producer:       Tesseract 3.04.00
Page size:      594.96 x 840 pts

@mbirth
Copy link
Author

mbirth commented Oct 29, 2016

But then, why does tesseract behave inconsistently between tif and png when both have Units: Undefined?

@jbreiden
Copy link
Contributor

jbreiden commented Oct 29, 2016

Haven't had time to look at TIFF, but the PNG behaviour looks right. Spec says we know nothing about image resolution. Common practice from time immemorial is to default to some hopelessly wrong value. I could go trace code to find out what number was used, but honestly this is a garbage in, garbage out situation. Not sure it is worth spending time on. Are you in contact with the authors of the program that is producing the bad metadata? Fixing that is top priority.

The following values are legal for the unit specifier:
   0: unit is unknown
   1: unit is the meter
When the unit specifier is 0, the pHYs chunk defines pixel aspect ratio only; the actual 
size of the pixels remains unspecified.

@mbirth
Copy link
Author

mbirth commented Oct 29, 2016

I think I know why the units are Undefined. pdfsandwich does a 2-step conversion from a PDF page to tif:

convert -colorspace Gray -colors 256 -depth 8 -background white -flatten +matte -density 300x300 scanned.pdf[0] tmpfile.ppm

Which gives:

Image: tmpfile.ppm
  Format: PPM (Portable pixmap format (color))
  Mime type: image/x-portable-pixmap
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Units: Undefined
  Type: Grayscale
  Endianess: Undefined
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

And then:

convert -density 300x300 tmpfile.ppm tmpfile.tif

Which results in:

Image: tmpfile.tif
  Format: TIFF (Tagged Image File Format)
  Mime type: image/tiff
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: LSB
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

I'll open a ticket with pdfsandwich to add -unit PixelsPerInch to the convert command.

EDIT: https://sourceforge.net/p/pdfsandwich/bugs/14/

@jbreiden
Copy link
Contributor

  1. Why use two convert commands instead of just one?
  2. I suggest PNG over uncompressed TIFF. Filesize of the PDF should be smaller, and because Tesseract can skip transcoding the image, there will be some CPU savings as well.

@mbirth
Copy link
Author

mbirth commented Oct 31, 2016

Yup, asked both in the ticket there.

@Jmuccigr
Copy link

Jmuccigr commented Nov 7, 2016

  1. Why use two convert commands instead of just one?

Because there's a use of unpaper in between them. It's info file says:

The image-file formats accepted by unpaper are those that libav can handle. In particular it supports the whole PNM-family: PBM, PGM and PPM. This ensures interoperability with the SANE tools under Linux. Support for TIFF and other complex file formats is not guaranteed.

That said, libav says that it handles png and tiff, if I read it correctly.

@zdenop zdenop added the PDF label Nov 24, 2016
@jbarlow83
Copy link

@mbirth I'm the author of ocrmypdf, which is similar to pdfsandwich. <plug>ocrmypdf is extremely carefully in handling of DPI and handles a lot of edge cases that pdfsandwich does not.</plug> It handles your file without issue.

@jbreiden I think it would be helpful for tesseract to issue a warning when the DPI is nonsense. Lots of programs don't handle this metadata correctly so it's easy for a workflow to discard it. Wrong DPI isn't just a display/printing issue; in the case of say, scanned maps, losing scale information can change the interpretation.

@jbreiden
Copy link
Contributor

That is a very good idea. Hope I remember once the turkey coma wears off.

@jbreiden
Copy link
Contributor

jbreiden commented Nov 28, 2016

This looks like a spot where we should emit the warning, but is not executed.

int resolution = (kMinCredibleResolution > pixGetXRes(pix)) ?

This spot thinks the resolution is 0.

estimated_res_ = yres_ = pixGetYRes(pix_);

Oh, oh, maybe here.

thresholder_->SetSourceYResolution(kMinCredibleResolution);

@jbreiden
Copy link
Contributor

jbreiden commented Nov 28, 2016

Looks like we have kMinCredibleResolution defined in two places. Only the
one in baseapi.ccp is active for this test case.

--- tesseract/api/baseapi.cpp	2016-11-07 07:44:03.000000000 -0800
+++ tesseract/api/baseapi.cpp	2016-11-28 11:23:48.000000000 -0800
@@ -2226,6 +2226,8 @@
   if (y_res < kMinCredibleResolution || y_res > kMaxCredibleResolution) {
     // Use the minimum default resolution, as it is safer to under-estimate
     // than over-estimate resolution.
+    tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+            y_res, kMinCredibleResolution);
     thresholder_->SetSourceYResolution(kMinCredibleResolution);
   }
   PageSegMode pageseg_mode =
--- tesseract/ccmain/osdetect.cpp	2016-11-07 07:44:03.000000000 -0800
+++ tesseract/ccmain/osdetect.cpp	2016-11-28 11:31:13.000000000 -0800
@@ -164,8 +164,14 @@
   int vertical_y = 1;
   tesseract::TabVector_LIST v_lines;
   tesseract::TabVector_LIST h_lines;
-  int resolution = (kMinCredibleResolution > pixGetXRes(pix)) ?
-      kMinCredibleResolution : pixGetXRes(pix);
+  int resolution;
+  if (kMinCredibleResolution > pixGetXRes(pix)) {
+    resolution = kMinCredibleResolution;
+    tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+            pixGetXRes(pix), resolution);
+  } else {
+    resolution = pixGetXRes(pix);
+  }
 
   tesseract::LineFinder::FindAndRemoveLines(resolution, false, pix,
                                             &vertical_x, &vertical_y,

@zdenop zdenop closed this as completed in ed4c4c6 Dec 7, 2016
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants