-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output PDFs have decreased quality #125
Comments
(I dedicate myself to help to fix the issues. If I can.) |
If you are using The normal behavior is to error out on pages that contain text rather than presume what the user wants. Or if |
I just want to point out: do not use lossy compression when recompressing. Even when the input was jpeg (lossy), do not use again jpeg. This step will worsen the image quality. The issue was discussed in the past many times. |
Or, best method: use the original image as output image, as discussed in tesseract-ocr/tesseract#660 , if this becomes possible. |
Just for the record:
explains this
|
Lossy compression on output is only enabled when lossy compression was used on the input and it is not possible to transfer the input image to the output (because of JPEG recompression is certainly not ideal but keeping file sizes similar is quite important for my users and clients. |
@jbarlow83 I understand that some people don't like big files, however, bits are cheap today. For those poor people like me who are interested and working in the archiving business, quality matters. When adding OCR layer, the original scans (I mean the images in the PDF) should left untouched. So we both should come to the conclusion, that both
This is also, why I installed the whole toolchain (ghostscript, unpaper, tesseract, ocrmypdf) in their latest available versions ("bleeding edge"), and why I wish to work together with you to find a solution for 1. and 2. |
Since you explained your workflow in the tesseract forums I now understand why you're using
|
Yes, you've got it. |
Please let me know, if you can "help" me (passing through the original image/original image quality). Until then, I close this issue so that everyone can concentrate on other issues. |
Can and will, but it may take a few weeks.
…On Sat, Jan 21, 2017 at 16:44 Wikinaut ***@***.***> wrote:
Closed #125 <#125>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#125 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM2sr7rP7UTAXEo3nyzk56-wqtOfCks5rUqZ7gaJpZM4LlJFB>
.
|
I'm having the same issue as you and wonder if you have found a way to preserve the original image quality when doing Searching through the issues here and applying |
@jbarlow83
When comparing the visual quality of PDF files output from OCRmyPDF, I noticed a degradation in almost any cases I tried.
It looks, as if somewhere in the processing chain either a lossy compression or some other image-quality-decreasing step is performed.
Please can you check this in your workflow.
I always prefer that the PDF output quality is exactly the same as the input quality.
See also tesseract-ocr/tesseract#660 .
The text was updated successfully, but these errors were encountered: