Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR Resolution and PDF output #2108

Closed
CanadianHusky opened this issue Dec 6, 2018 · 7 comments
Closed

OCR Resolution and PDF output #2108

CanadianHusky opened this issue Dec 6, 2018 · 7 comments

Comments

@CanadianHusky
Copy link

This post is more of a feature request, or maybe a little known command line option already exists that I am not aware of that can accomplish the following scenario

Scenario 1:
Input PNG/JPG/TIF at DIN A3 size at 600/1200dpi resolution with high quality images and text is fed to tesseract with added "pdf" switch in the command line.

The output pdf is mostly fine but because of the high input resolution, processing takes a long time.

The same content is downsampled and fed to tesseract at 150/200/300dpi for testing purposes.
The output has a --higher-- level of OCR accuracy and the processing speed is considerably faster as expected.
The only problem is that the output PDF now has lost the quality of the original input.

Scenario 2:
Input image at Arch E (36x48 inch), 400dpi, color scan. That is 14,400x19,200 of RGB data with 3 bytes per pixel. The memory allocation requirement is so high that tesseract (and almost any other 'regular' software too) fails even before starting.
If the image is downsampled to 150/200 dpi , tesseract is able to generate a largeformat pdf with extreme good accuracy of the ocr text and reasonable good processing speed, but once again the fidelity of the original is lost due to downsampling.

Suggestion A:
having an -c OCRResolution=... option at the command line that internally downsamples the input to that dpi, performs OCR on it, but uses the -original- image for pdf output would solve the above noted concerns very easily. More wonderful would be an additional switch with -c Outputresolution=.... where the user can choose. It may very well be that input is 1200 dpi, ocr resolution is optimum at 200 and output resolution is optimum at 600 for a smaller filesize.

Suggestion B:
alternatively, having command line option for -c OCRImage=..... -c PDFOutputImage=...... and allow user to specify seperate input images, one to be used for ocr processing, one to be embedded into the pdf

either one of the suggestions would make a valuable addition in my opinion.

Thank you for the consideration

@CanadianHusky
Copy link
Author

CanadianHusky commented Dec 6, 2018

Correction.
My test above was based on an older RC version.
20181030 version is able to allocate enough memory on a 36x48 inch sheet and is able to produce output with almost frightening accuracy.

I also tried the --dpi command line setting, which does seem to be doing something but I am not sure exactly what it does

when I supply --dpi 200 with the intention of doing ocr at that resolution while the input is 400dpi, I get an output pdf file 200% larger than the input. Its not a big deal for me to resize that pdf per code because the content is correct but I would like to understand what exactly --dpi is trying to do internally

@stweil
Copy link
Member

stweil commented Dec 6, 2018

--dpi 200 overrides the resolution information from the image's metadata. Use it for images without any resolution information.

@zdenop
Copy link
Contributor

zdenop commented Dec 14, 2018

I am afraid this is out of tesseract project scope (but you can contribute code :-)).
tesseract is focused on OCR and current philosophy is not modify input image for pdf output (e.g. there was already request for decreasing size of pdf by down-sampling of image, even there could be other reasonable actions like deskewing or dewarping input images...)

IMO there are other projects where your request would make more sense: e.g. https://github.com/jbarlow83/ocrmypdf or https://github.com/OpaitSoftware/TesseractStudio.Net

I will keep it open for comment of @jbreiden (who contribute pdf output) for finale decision. At the moment I label it as WontFix.

@zdenop zdenop added the wontfix label Dec 14, 2018
@jbreiden
Copy link
Contributor

jbreiden commented Dec 17, 2018 via email

@CanadianHusky
Copy link
Author

I reviewed comments in issue #660 and it completely covers what I had in mind.
The textonly_pdf option works perfectly fine as explained in the 20181030 final 4.0 release.
I also read @Wikinaut 's comments and that he is rather not happy with that solution but my processing has the same goal that he is trying to achieve and the suggested method works fine. Therefore I did not understand what he is not happy about but I can confirm the suggested method works and as suggested by @jbreiden "unbagging" the original images out of the pdf without rendering guarantees that there is no quality loss, plus it is much faster than (re-)rendering at a possibly high resolution.

I closed this issue. Thank you again.

@Wikinaut
Copy link
Contributor

@CanadianHusky thanks for reporting this. I did not use the(my) specific Tesseract workflow in the last months, but now I will check this again.

@jbreiden
Copy link
Contributor

jbreiden commented Dec 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants