Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use DPI as a way to refer to word size in documentation #1846

Closed
albertoandreottiATgmail opened this issue Aug 16, 2018 · 8 comments
Closed

Comments

@albertoandreottiATgmail

Hi,

here,
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#rescaling

you recommend to use 300 DPI images. That doesn't make any sense, images don't have a DPI until you print them.
You should give your recommendation in terms of minimal number of pixels for the height of a word, for example. I can have an image where letter 'a' is 20 pixels high, or 200 pixels high. Both images will have different results in terms of performance.
As an independent fact, I can indeed print both images with 300dpi.

Am I missing something?

Alberto.

@H-Bluhm
Copy link

H-Bluhm commented Aug 17, 2018

I would assume that dpi and ppi are used interchangeably here.
Since, as you laid out, the technical meaning of dpi does not make a lot of sense in this case, I think ppi is what was meant.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 17, 2018 via email

@jbreiden
Copy link
Contributor

jbreiden commented Sep 6, 2018

That doesn't make any sense, images don't have a DPI until you print them.

JPEG & PNG both support resolution metadata. Please use it.

@Charlie313
Copy link

How do I just remove my self completely from all of this

@zdenop
Copy link
Contributor

zdenop commented Sep 28, 2018

@albertoandreottiATgmail: I do not know what is your aim, but you are taking it from wrong end.
If you are digitizing paper document you will get different image quality (= different OCR quality) if you do scan with 300 dpi or 70 dpi. Regardless your size of letter on the paper is the same. That why there is suggestion about image dpi: you can not change size of printed letters.
Anyway please use tesseract user forum for discussion.

@zdenop zdenop closed this as completed Sep 28, 2018
@stweil
Copy link
Member

stweil commented Sep 29, 2018

@albertoandreottiATgmail and @zdenop, you are talking about different things. Yes, of course it makes a difference whether scanning is done with a high or a low resolution. But that is only a relative value. Scanning a large poster with 70 dpi will give the same picture as scanning a small printout of the poster with 300 dpi. A human won't see any difference when watching the resulting image file on a screen and will be able to read text in both cases. So I'd expect that it also does not make a difference for Tesseract. Currently it does! An image which was converted from 300 dpi to 600 dpi gives a different (typically better) result with Tesseract, although no information was added and the quality of the image won't get better by such a conversion. Other OCR software does not need or use the resolution information from the input image as far as I know.

@amitdo
Copy link
Collaborator

amitdo commented Sep 29, 2018

The explanation the OP expects is already present in another wiki page.
See my above link.

Also see Ray's remark in #756 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants