Output PDFs have decreased quality #125

Wikinaut · 2017-01-17T01:45:19Z

When comparing the visual quality of PDF files output from OCRmyPDF, I noticed a degradation in almost any cases I tried.

It looks, as if somewhere in the processing chain either a lossy compression or some other image-quality-decreasing step is performed.

Please can you check this in your workflow.

I always prefer that the PDF output quality is exactly the same as the input quality.

See also tesseract-ocr/tesseract#660 .

Wikinaut · 2017-01-17T01:46:08Z

(I dedicate myself to help to fix the issues. If I can.)

Wikinaut · 2017-01-17T02:06:18Z

Original (input)

Output pdf after OCRmyPDF

I always prefer that the PDF output quality is exactly the same as the input quality. Currently, this is not the case.

jbarlow83 · 2017-01-17T23:08:37Z

If you are using --force-ocr (-f) (like in your previous example) then there will be some degradation because this mode rasterizes the PDF to one image per page and builds a new PDF from the images. If the image was originally stored as lossless it will be saved as lossless again (possibly losing some data), or if JPEG it will be reconstructed as a new JPEG (definitely lossy).

The normal behavior is to error out on pages that contain text rather than presume what the user wants. Or if --skip-text is given, do not perform OCR on any page that has text already and copy the page to the output without modification. Is the intention with these files to redo OCR done by older Tesseract?

Wikinaut · 2017-01-19T20:08:08Z

I just want to point out: do not use lossy compression when recompressing. Even when the input was jpeg (lossy), do not use again jpeg. This step will worsen the image quality.

The issue was discussed in the past many times.

Wikinaut · 2017-01-19T20:09:43Z

Or, best method: use the original image as output image, as discussed in tesseract-ocr/tesseract#660 , if this becomes possible.

Wikinaut · 2017-01-19T22:45:50Z

Just for the record:

$ocrmypdf --help

explains this

OCRmyPDF attempts to keep the output file at about the same size.  If a file
contains losslessly compressed images, and output file will be losslessly
compressed as well.

jbarlow83 · 2017-01-19T23:42:46Z

Lossy compression on output is only enabled when lossy compression was used on the input and it is not possible to transfer the input image to the output (because of --force-ocr or preprocessing that alters the image anyway).

JPEG recompression is certainly not ideal but keeping file sizes similar is quite important for my users and clients.

Wikinaut · 2017-01-20T15:27:15Z

@jbarlow83 I understand that some people don't like big files, however, bits are cheap today. For those poor people like me who are interested and working in the archiving business, quality matters. When adding OCR layer, the original scans (I mean the images in the PDF) should left untouched.

So we both should come to the conclusion, that both

compact (mixed mode output) files, and/or
highest available output image quality (= bitwise-exactly same as input quality, but not a single introduced artefact) of these output files should be user-selectable. This is why I am now so very interested in the recent discussions, and triggered them, and of course in solutions. I will help to find them.

This is also, why I installed the whole toolchain (ghostscript, unpaper, tesseract, ocrmypdf) in their latest available versions ("bleeding edge"), and why I wish to work together with you to find a solution for 1. and 2.

jbarlow83 · 2017-01-21T00:10:35Z

Since you explained your workflow in the tesseract forums I now understand why you're using --force-ocr. I will look into adding an option to discard any existing OCR text for the purpose of redoing OCR. With that in place I can transfer input PDF pages to output while grafting on the invisible text layer.

--force-ocr needs to stay in place because it has a different use case. It's useful as a big hammer to normalize weird PDFs.

Wikinaut · 2017-01-21T00:15:15Z

Yes, you've got it.

Wikinaut · 2017-01-22T00:44:43Z

Please let me know, if you can "help" me (passing through the original image/original image quality). Until then, I close this issue so that everyone can concentrate on other issues.

jbarlow83 · 2017-01-23T06:15:26Z

Can and will, but it may take a few weeks.

…

On Sat, Jan 21, 2017 at 16:44 Wikinaut ***@***.***> wrote: Closed #125 <#125>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#125 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM2sr7rP7UTAXEo3nyzk56-wqtOfCks5rUqZ7gaJpZM4LlJFB> .

sojusnik · 2024-04-17T09:17:19Z

@Wikinaut

I'm having the same issue as you and wonder if you have found a way to preserve the original image quality when doing force-ocr?

Searching through the issues here and applying --optimize 0 or/and --output-type pdf or/and --pdfa-image-compression lossless didn't help.

Wikinaut closed this as completed Jan 17, 2017

Wikinaut reopened this Jan 17, 2017

Wikinaut closed this as completed Jan 22, 2017

Jmuccigr mentioned this issue May 3, 2017

Images unnecessarily compressed? #163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output PDFs have decreased quality #125

Output PDFs have decreased quality #125

Wikinaut commented Jan 17, 2017

Wikinaut commented Jan 17, 2017

Wikinaut commented Jan 17, 2017 •

edited

Loading

jbarlow83 commented Jan 17, 2017

Wikinaut commented Jan 19, 2017

Wikinaut commented Jan 19, 2017 •

edited

Loading

Wikinaut commented Jan 19, 2017

jbarlow83 commented Jan 19, 2017

Wikinaut commented Jan 20, 2017 •

edited

Loading

jbarlow83 commented Jan 21, 2017 •

edited

Loading

Wikinaut commented Jan 21, 2017

Wikinaut commented Jan 22, 2017

jbarlow83 commented Jan 23, 2017 via email

sojusnik commented Apr 17, 2024 •

edited

Loading

Output PDFs have decreased quality #125

Output PDFs have decreased quality #125

Comments

Wikinaut commented Jan 17, 2017

Wikinaut commented Jan 17, 2017

Wikinaut commented Jan 17, 2017 • edited Loading

jbarlow83 commented Jan 17, 2017

Wikinaut commented Jan 19, 2017

Wikinaut commented Jan 19, 2017 • edited Loading

Wikinaut commented Jan 19, 2017

jbarlow83 commented Jan 19, 2017

Wikinaut commented Jan 20, 2017 • edited Loading

jbarlow83 commented Jan 21, 2017 • edited Loading

Wikinaut commented Jan 21, 2017

Wikinaut commented Jan 22, 2017

jbarlow83 commented Jan 23, 2017 via email

sojusnik commented Apr 17, 2024 • edited Loading

Wikinaut commented Jan 17, 2017 •

edited

Loading

Wikinaut commented Jan 19, 2017 •

edited

Loading

Wikinaut commented Jan 20, 2017 •

edited

Loading

jbarlow83 commented Jan 21, 2017 •

edited

Loading

sojusnik commented Apr 17, 2024 •

edited

Loading