Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output PDFs have decreased quality #125

Closed
Wikinaut opened this issue Jan 17, 2017 · 13 comments
Closed

Output PDFs have decreased quality #125

Wikinaut opened this issue Jan 17, 2017 · 13 comments

Comments

@Wikinaut
Copy link

@jbarlow83

When comparing the visual quality of PDF files output from OCRmyPDF, I noticed a degradation in almost any cases I tried.

It looks, as if somewhere in the processing chain either a lossy compression or some other image-quality-decreasing step is performed.

Please can you check this in your workflow.

I always prefer that the PDF output quality is exactly the same as the input quality.

See also tesseract-ocr/tesseract#660 .

@Wikinaut
Copy link
Author

(I dedicate myself to help to fix the issues. If I can.)

@Wikinaut
Copy link
Author

Wikinaut commented Jan 17, 2017

Original (input)

20170117-03 03 34_auswahl

Output pdf after OCRmyPDF

20170117-03 03 58_auswahl

I always prefer that the PDF output quality is exactly the same as the input quality. Currently, this is not the case.

@Wikinaut Wikinaut reopened this Jan 17, 2017
@jbarlow83
Copy link
Collaborator

If you are using --force-ocr (-f) (like in your previous example) then there will be some degradation because this mode rasterizes the PDF to one image per page and builds a new PDF from the images. If the image was originally stored as lossless it will be saved as lossless again (possibly losing some data), or if JPEG it will be reconstructed as a new JPEG (definitely lossy).

The normal behavior is to error out on pages that contain text rather than presume what the user wants. Or if --skip-text is given, do not perform OCR on any page that has text already and copy the page to the output without modification. Is the intention with these files to redo OCR done by older Tesseract?

@Wikinaut
Copy link
Author

I just want to point out: do not use lossy compression when recompressing. Even when the input was jpeg (lossy), do not use again jpeg. This step will worsen the image quality.

The issue was discussed in the past many times.

@Wikinaut
Copy link
Author

Wikinaut commented Jan 19, 2017

Or, best method: use the original image as output image, as discussed in tesseract-ocr/tesseract#660 , if this becomes possible.

@Wikinaut
Copy link
Author

Just for the record:

$ocrmypdf --help

explains this

OCRmyPDF attempts to keep the output file at about the same size.  If a file
contains losslessly compressed images, and output file will be losslessly
compressed as well.

@jbarlow83
Copy link
Collaborator

Lossy compression on output is only enabled when lossy compression was used on the input and it is not possible to transfer the input image to the output (because of --force-ocr or preprocessing that alters the image anyway).

JPEG recompression is certainly not ideal but keeping file sizes similar is quite important for my users and clients.

@Wikinaut
Copy link
Author

Wikinaut commented Jan 20, 2017

@jbarlow83 I understand that some people don't like big files, however, bits are cheap today. For those poor people like me who are interested and working in the archiving business, quality matters. When adding OCR layer, the original scans (I mean the images in the PDF) should left untouched.

So we both should come to the conclusion, that both

  1. compact (mixed mode output) files, and/or
  2. highest available output image quality (= bitwise-exactly same as input quality, but not a single introduced artefact) of these output files should be user-selectable. This is why I am now so very interested in the recent discussions, and triggered them, and of course in solutions. I will help to find them.

This is also, why I installed the whole toolchain (ghostscript, unpaper, tesseract, ocrmypdf) in their latest available versions ("bleeding edge"), and why I wish to work together with you to find a solution for 1. and 2.

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jan 21, 2017

Since you explained your workflow in the tesseract forums I now understand why you're using --force-ocr. I will look into adding an option to discard any existing OCR text for the purpose of redoing OCR. With that in place I can transfer input PDF pages to output while grafting on the invisible text layer.

--force-ocr needs to stay in place because it has a different use case. It's useful as a big hammer to normalize weird PDFs.

@Wikinaut
Copy link
Author

Yes, you've got it.

@Wikinaut
Copy link
Author

Please let me know, if you can "help" me (passing through the original image/original image quality). Until then, I close this issue so that everyone can concentrate on other issues.

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jan 23, 2017 via email

@sojusnik
Copy link

sojusnik commented Apr 17, 2024

@Wikinaut

I'm having the same issue as you and wonder if you have found a way to preserve the original image quality when doing force-ocr?

Searching through the issues here and applying --optimize 0 or/and --output-type pdf or/and --pdfa-image-compression lossless didn't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants