-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR Resolution and PDF output #2108
Comments
Correction. I also tried the --dpi command line setting, which does seem to be doing something but I am not sure exactly what it does when I supply --dpi 200 with the intention of doing ocr at that resolution while the input is 400dpi, I get an output pdf file 200% larger than the input. Its not a big deal for me to resize that pdf per code because the content is correct but I would like to understand what exactly --dpi is trying to do internally |
|
I am afraid this is out of tesseract project scope (but you can contribute code :-)). IMO there are other projects where your request would make more sense: e.g. https://github.com/jbarlow83/ocrmypdf or https://github.com/OpaitSoftware/TesseractStudio.Net I will keep it open for comment of @jbreiden (who contribute pdf output) for finale decision. At the moment I label it as WontFix. |
This is a pretty advanced scenario, and I think the best way is using the
textonly_pdf option. Then you can merge the high resolution images into the
resulting PDF using an external tool. See issue #660 for an example of one
such tool.
|
I reviewed comments in issue #660 and it completely covers what I had in mind. I closed this issue. Thank you again. |
@CanadianHusky thanks for reporting this. I did not use the(my) specific Tesseract workflow in the last months, but now I will check this again. |
If you are feeling generous, please find the documentation for this feature
and improve it. Glad that it works for you!
|
This post is more of a feature request, or maybe a little known command line option already exists that I am not aware of that can accomplish the following scenario
Scenario 1:
Input PNG/JPG/TIF at DIN A3 size at 600/1200dpi resolution with high quality images and text is fed to tesseract with added "pdf" switch in the command line.
The output pdf is mostly fine but because of the high input resolution, processing takes a long time.
The same content is downsampled and fed to tesseract at 150/200/300dpi for testing purposes.
The output has a --higher-- level of OCR accuracy and the processing speed is considerably faster as expected.
The only problem is that the output PDF now has lost the quality of the original input.
Scenario 2:
Input image at Arch E (36x48 inch), 400dpi, color scan. That is 14,400x19,200 of RGB data with 3 bytes per pixel. The memory allocation requirement is so high that tesseract (and almost any other 'regular' software too) fails even before starting.
If the image is downsampled to 150/200 dpi , tesseract is able to generate a largeformat pdf with extreme good accuracy of the ocr text and reasonable good processing speed, but once again the fidelity of the original is lost due to downsampling.
Suggestion A:
having an -c OCRResolution=... option at the command line that internally downsamples the input to that dpi, performs OCR on it, but uses the -original- image for pdf output would solve the above noted concerns very easily. More wonderful would be an additional switch with -c Outputresolution=.... where the user can choose. It may very well be that input is 1200 dpi, ocr resolution is optimum at 200 and output resolution is optimum at 600 for a smaller filesize.
Suggestion B:
alternatively, having command line option for -c OCRImage=..... -c PDFOutputImage=...... and allow user to specify seperate input images, one to be used for ocr processing, one to be embedded into the pdf
either one of the suggestions would make a valuable addition in my opinion.
Thank you for the consideration
The text was updated successfully, but these errors were encountered: