OCR Resolution and PDF output #2108

CanadianHusky · 2018-12-06T14:25:09Z

This post is more of a feature request, or maybe a little known command line option already exists that I am not aware of that can accomplish the following scenario

Scenario 1:
Input PNG/JPG/TIF at DIN A3 size at 600/1200dpi resolution with high quality images and text is fed to tesseract with added "pdf" switch in the command line.

The output pdf is mostly fine but because of the high input resolution, processing takes a long time.

The same content is downsampled and fed to tesseract at 150/200/300dpi for testing purposes.
The output has a --higher-- level of OCR accuracy and the processing speed is considerably faster as expected.
The only problem is that the output PDF now has lost the quality of the original input.

Scenario 2:
Input image at Arch E (36x48 inch), 400dpi, color scan. That is 14,400x19,200 of RGB data with 3 bytes per pixel. The memory allocation requirement is so high that tesseract (and almost any other 'regular' software too) fails even before starting.
If the image is downsampled to 150/200 dpi , tesseract is able to generate a largeformat pdf with extreme good accuracy of the ocr text and reasonable good processing speed, but once again the fidelity of the original is lost due to downsampling.

Suggestion A:
having an -c OCRResolution=... option at the command line that internally downsamples the input to that dpi, performs OCR on it, but uses the -original- image for pdf output would solve the above noted concerns very easily. More wonderful would be an additional switch with -c Outputresolution=.... where the user can choose. It may very well be that input is 1200 dpi, ocr resolution is optimum at 200 and output resolution is optimum at 600 for a smaller filesize.

Suggestion B:
alternatively, having command line option for -c OCRImage=..... -c PDFOutputImage=...... and allow user to specify seperate input images, one to be used for ocr processing, one to be embedded into the pdf

either one of the suggestions would make a valuable addition in my opinion.

Thank you for the consideration

CanadianHusky · 2018-12-06T15:02:56Z

Correction.
My test above was based on an older RC version.
20181030 version is able to allocate enough memory on a 36x48 inch sheet and is able to produce output with almost frightening accuracy.

I also tried the --dpi command line setting, which does seem to be doing something but I am not sure exactly what it does

when I supply --dpi 200 with the intention of doing ocr at that resolution while the input is 400dpi, I get an output pdf file 200% larger than the input. Its not a big deal for me to resize that pdf per code because the content is correct but I would like to understand what exactly --dpi is trying to do internally

stweil · 2018-12-06T15:06:51Z

--dpi 200 overrides the resolution information from the image's metadata. Use it for images without any resolution information.

zdenop · 2018-12-14T09:30:28Z

I am afraid this is out of tesseract project scope (but you can contribute code :-)).
tesseract is focused on OCR and current philosophy is not modify input image for pdf output (e.g. there was already request for decreasing size of pdf by down-sampling of image, even there could be other reasonable actions like deskewing or dewarping input images...)

IMO there are other projects where your request would make more sense: e.g. https://github.com/jbarlow83/ocrmypdf or https://github.com/OpaitSoftware/TesseractStudio.Net

I will keep it open for comment of @jbreiden (who contribute pdf output) for finale decision. At the moment I label it as WontFix.

jbreiden · 2018-12-17T03:45:00Z

This is a pretty advanced scenario, and I think the best way is using the textonly_pdf option. Then you can merge the high resolution images into the resulting PDF using an external tool. See issue #660 for an example of one such tool.

CanadianHusky · 2018-12-18T07:20:25Z

I reviewed comments in issue #660 and it completely covers what I had in mind.
The textonly_pdf option works perfectly fine as explained in the 20181030 final 4.0 release.
I also read @Wikinaut 's comments and that he is rather not happy with that solution but my processing has the same goal that he is trying to achieve and the suggested method works fine. Therefore I did not understand what he is not happy about but I can confirm the suggested method works and as suggested by @jbreiden "unbagging" the original images out of the pdf without rendering guarantees that there is no quality loss, plus it is much faster than (re-)rendering at a possibly high resolution.

I closed this issue. Thank you again.

Wikinaut · 2018-12-18T07:38:40Z

@CanadianHusky thanks for reporting this. I did not use the(my) specific Tesseract workflow in the last months, but now I will check this again.

jbreiden · 2018-12-18T15:49:22Z

If you are feeling generous, please find the documentation for this feature and improve it. Glad that it works for you!

CanadianHusky mentioned this issue Dec 6, 2018

Using tesseract for generating searchable PDF with images containing multiple orientation text blocks #2055

Open

stweil added the feature request label Dec 6, 2018

zdenop added the wontfix label Dec 14, 2018

CanadianHusky closed this as completed Dec 18, 2018

Galunid mentioned this issue Oct 18, 2020

Scanned pdf: text selection accuracy depends on screen dpi and book dpi koreader/koreader#3688

Open

amitdo added the PDF label Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR Resolution and PDF output #2108

OCR Resolution and PDF output #2108

CanadianHusky commented Dec 6, 2018

CanadianHusky commented Dec 6, 2018 •

edited

Loading

stweil commented Dec 6, 2018

zdenop commented Dec 14, 2018

jbreiden commented Dec 17, 2018 via email

CanadianHusky commented Dec 18, 2018

Wikinaut commented Dec 18, 2018

jbreiden commented Dec 18, 2018 via email

OCR Resolution and PDF output #2108

OCR Resolution and PDF output #2108

Comments

CanadianHusky commented Dec 6, 2018

CanadianHusky commented Dec 6, 2018 • edited Loading

stweil commented Dec 6, 2018

zdenop commented Dec 14, 2018

jbreiden commented Dec 17, 2018 via email

CanadianHusky commented Dec 18, 2018

Wikinaut commented Dec 18, 2018

jbreiden commented Dec 18, 2018 via email

CanadianHusky commented Dec 6, 2018 •

edited

Loading