Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrap scale estimation as separate processor for DPI estimation #34

Open
bertsky opened this issue Dec 18, 2019 · 0 comments
Open

wrap scale estimation as separate processor for DPI estimation #34

bertsky opened this issue Dec 18, 2019 · 0 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Dec 18, 2019

It would be useful to have a dedicated processor for DPI estimation in OCR-D. That's because we cannot rely on DPI metadata, although we need to. (Most Ocropy segmentation steps now zoom in on the annotated DPI value in order to forego the 300 DPI assumption. This situation is likely similar with other modules.)

Tesseract already has such a functionality, which is based on its internal line segmentation: first the average scale gets estimated, then it gets multiplied by a constant to yield the DPI. This is based under the assumption that xheight is more or less homogeneous across the page. (Which it is not!) But Tesseract's API does not export that estimation, and does not give access to the TO_BLOCK_LIST which holds the average line_size.

So it's probably best to use ocrolib.psegutils.estimate_scale for this in the same fashion.

But since we know that pages can have widely varying font sizes, we should look at scales more locally, and then find a better statistic than just median to give us the mean xheight of a 12pt text line.

This could be achieved as follows: in estimate_scale, we add an option to look at the np.histogram of blob sizes (square root of box areas for connected components), trying to filter out both the tiny boxes originating from noise and the huge boxes from headings and drop-caps. Then we use that in a dedicated processor ocrd-cis-ocropy-estimate-density, multiplying the estimated scale with a configurable constant factor (which defaults e.g. to 10) to yield the DPI estimation. We annotate this in PAGE-XML under PcGts/Page/@imageXResolution and PcGts/Page/@imageYResolution with PcGts/Page/@imageResolutionUnit="PPI". A future OcrdExif in core can then use that information to override the EXIF data found in the binary image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant