Relax requirements on pixel density in image metadata #129

kba · 2019-11-01T16:04:38Z

A concession to a world of full of erroneous metadata.

bertsky · 2019-11-02T21:57:46Z

mets.md

+However, since technical metadata about pixel density is so often lost in
+conversion or inaccurate, processors should assume **300 ppi** for images with
+missing or suspiciously low pixel density metadata.
+


What (exactly/programmatically) is suspiciously low?

Is 72? (72 is often used for non-print use/export. Like dfg-viewer.)

How about suspiciously high?

Shouldn't that suspicion be relative to pixel resolution, too? (An image with low density and low resolution is not as suspicious as an image with low density but high resolution. Et cetera.)

(Fundamentally, I disagree with this solution. What is the point in having these meta-data and making them a requirement, if processors cannot use them anyway? And how is this supposed to work for implementors? If the algorithm needs an accurate DPI value to function properly or to expose dimension parameters to the user, but the framework does not provide it, then results will be bad without any indication of the cause.)

Is 72? (72 is often used for non-print use/export. Like dfg-viewer.)

No, 72 is low but not suspiciously low. Lots of older digitized material that have by default only the low-quality export images available.

What is the point in having these meta-data and making them a requirement, if processors cannot use them anyway?

Essentially, the point would be to not make this metadata a requirement because reality. The content we are targeting primarily, historical prints digitized as part of the VD-efforts, would require at least 300 ppi when adhering to the DFG guidelines on digitisation. But whether that is the case, whether these high-res images are exposed to the public in METS files and if not, whether they even still exist, is far from clear. Without accurate measurements of the digitized object (which should be part of the technical metadata but, you guessed it, is not consistently) there is no way to sensibly guess the pixel density AFAICS.

results will be bad without any indication of the cause.

low PPI would still be a warning when validating the results. Image heuristics were also part of the cancelled quality assurance bid.

I understand your frustration but I see no other way to move forward than relaxing the rules and have implementors assume that images have a certain pixel density.

@kba @bertsky What about the following approach: We implement a processor (maybe as an extension to the exif Wrapper) which implements a set of heuristics to detect suspicious (DPI) values. The processor also implements an auto-repair strategy (cf. tesseract-ocr/tesseract#1702 (comment)) which adds an automatically detected DPI value to the metadata (of course with a hint that the value was indeed automatically detected). Btw. in the original project layout this was the task of module 1. We now see why the DFG would have done well in funding the corresponding proposal.

No, 72 is low but not suspiciously low. Lots of older digitized material that have by default only the low-quality export images available.

Then the above formulation is too vague IMO. And it does not allow implementors (like core) to act on the problem reported on our GT, where an image was in fact 600 DPI but its meta-data said 72. We must be able to deal with this extreme diversion. You did not yet address the point about the relation between pixel density and pixel resolution (which would be relevant here).

Essentially, the point would be to not make this metadata a requirement because reality.

But there's a big difference between a datum as a requirement and a datum being correct as a requirement. The currently proposed formulation does not say when the annotated density can/must be ignored.

I understand your frustration but I see no other way to move forward than relaxing the rules and have implementors assume that images have a certain pixel density.

...have a certain fixed density? That would sacrifice quality in a large scale. (For many algorithms in segmentation and recognition, actual DPI makes a big difference. If input data varies between 72 and 600 but we always must assume 300, results will be unnecessarily bad.)

I do see the problem of not being able to make assumptions on our input data, especially since your bids for quality assurance and image characterization have been declined – which falls on everyone's feet now anyway.

But that's why I think the spec should specify very clearly when implementors can/must ignore the meta-data or even reject the input altogether. The exact heuristic formula could be left to the implementors, but its intuition should be explained here at least. So core can implement the formula within its own (first approximation to) image characterization.

@wrznr yes, that's exactly what I think is the only way to deal with this situation. In fact it's what I proposed for core already (still without auto-repair). But here we first have to sanction and require that behaviour.

But the way out of the maze is the auto-correction! Having something like that will enable the processors to rely on DPI information.

You can achieve the same (sans the auto-correction) by doing workspace validation on the input data. This will raise errors if pixel density is below 70 or not provided. But what then?

It's not the same thing: workspace validation can only give you an error message, but dynamic OcrdExif attributes (or static METS NISO/MIX annotation) could actually repair your workflow (by either guessing the density, setting it to unknown / 1, or raising an exception).

You did not yet address the point about the relation between pixel density and pixel resolution (which would be relevant here).

One way I see how we could detect incorrectly high pixel density is be to approximate the page dimenstions from density and resolution and warn if the page would be smaller than 10x10 cm or larger than 40x40 cm or similar.

Yes, that's exactly what I meant above – except for warn. The exact heuristics could be much more tolerant than that, and probably take the type of medium into account (newspaper can be expected to be larger etc).

auto-correction

It doesn't solve the problem though. It means another step in the workflow and concealing the uncertainty on the pixel density. Right now, users get errors for this when validating the workspace and possibly warnings from OcrdExif. A dedicated initial step to "auto-correct" pixel density only hide these errors IMHO. Plus that is more a question of implementation. What do we expect from processor implementers? If I understand you correctly, you'd prefer to have strict rules in the specification ("If PPI < 300 or not providede, raise exception")?

It's not the same thing

Yes, that's why I wrote (sans the auto-correction)

take the type of medium into account

@tboenig What would be a reliable way to achieve that from commonly used METS?

It's not the same thing

Yes, that's why I wrote (sans the auto-correction)

No, it's not the same thing regardless of auto-repair. You can get a better behaviour of the workflow without extra validation steps (started by the workflow engine or the processor). Everything is better than a mere warning that simply drowns in the many errors and warnings you usually see on log output – as I said: exception (in the workflow), fallback to unknown/1, fallback to repair.

auto-correction

It doesn't solve the problem though. It means another step in the workflow and concealing the uncertainty on the pixel density.

Not necessarily. @wzrnr also said earlier:

(of course with a hint that the value was indeed automatically detected)

Most processors would not care to do their own plausibility checks or even repair heuristics – they need something that works better than both the wrong value and the 300 DPI assumption on average. But some processors will offer more clever mechanisms, and they can be informed about this via extra annotation/attribute (e.g. uncertain).

Plus that is more a question of implementation. What do we expect from processor implementers? If I understand you correctly, you'd prefer to have strict rules in the specification ("If PPI < 300 or not providede, raise exception")?

No, that's not what I said or meant. not provided is still fine and PPI < 300 is too simplistic – that was my very point! If density is implausible w.r.t. resolution – and this could be laid out with or without concrete figures or rules here, but it needs to mention both – or density is implausible on its own (i.e. extreme), then act: repair, or mark as implausible, or even raise exception. I can live with any of those 3, provided it is sanctioned by the spec and therefore reliable.

(And if implementors find this hard to achive, they could always use core, even partially.)

cneud · 2019-11-04T17:23:48Z

My 2cts:

image metadata such as provided by EXIF etc. can never be trusted
OCR-D processors should therefore always throw a warning if image ppi is <150 or not set, but continue processing the document
imho there is no reliable way to determine or even "repair" actual ppi post-digitization (and it is also well beyond the scope of OCR-D to fix this)
of course, the GT should not include such images, but assets may well include such data for testing
commonly OCR classifiers are optimized for 300 ppi, which is what OCR-D should follow too

cneud

Fine with assuming 300 ppi as default, but we should throw a warning if ppi is <150 or not set in image metadata (if this is not already in core)?

bertsky · 2019-11-05T15:35:50Z

Merging in the middle of a dispute where more interested parties have not even had the chance to participate: This is bad style!

Besides, this is an important decision with long-term consequences. I don't see the point in rushing this so hard.

wrznr · 2019-11-05T16:08:50Z

mets.md

@@ -30,6 +30,10 @@ $> exiftool output.tif |grep 'X Resolution'
 "150"
 ```

+However, since technical metadata about pixel density is so often lost in
+conversion or inaccurate, processors should assume **300 ppi** for images with


@kba I thought that we agreed that I would have the chance to reformulate that a bit...

kba · 2019-11-05T17:20:42Z

I did not want to kill the conversation. I just wanted to make pixel density a warning, not an error in core. Otherwise we can only recommend users to skip this error altogether (as we do in CI fo4 assets with the consequence that we still don't have reliably correct GT) which is not what we want.

So please consider this an ongoing task that will be fine tunes with input from @bertsky @wrznr @cneud et al. While I'm determined to release 2.0.0 today, there's no reason a 2.1.0 or similar cannot remedy your grievances.

cneud · 2019-11-05T17:28:25Z

100% ACK @kba.

Please @bertsky @wrznr consider the time pressure on us as we push for a release with the requested API changes & additions ready for the library pilots starting from tomorrow. Certainly your feedback is very welcome for >2.0.0, but we just had to act fast in this case.

Most actual functionality is moved to the page-segmentation project. This project now only implements the ocrd-compliant wrapper around it.

Relax requirements on pixel density in image metadata

7a568fe

kba requested a review from tboenig November 1, 2019 16:04

kba added a commit to kba/ocrd-core that referenced this pull request Nov 1, 2019

reduce pixel_density problem severity: error -> warning,

17cf70d

OCR-D/spec#129 OCR-D/assets#33

This was referenced Nov 1, 2019

reduce pixel_density problem severity: error -> warning, OCR-D/core#339

Merged

compression artifacts in GT OCR-D/assets#33

Open

bertsky requested changes Nov 2, 2019

View reviewed changes

kba requested a review from cneud November 4, 2019 11:27

cneud approved these changes Nov 4, 2019

View reviewed changes

kba merged commit 7a568fe into OCR-D:master Nov 5, 2019

wrznr reviewed Nov 5, 2019

View reviewed changes

bertsky referenced this pull request in ocr-d-modul-2-segmentierung/ocrd-pixelclassifier-segmentation Nov 19, 2019

Rewrite with OCR-D interface

e4eb430

Most actual functionality is moved to the page-segmentation project. This project now only implements the ocrd-compliant wrapper around it.

This was referenced Dec 18, 2019

Using “official” Exif schema as an alternative of extended image metadata? PRImA-Research-Lab/PAGE-XML#3

Closed

wrap scale estimation as separate processor for DPI estimation cisocrgroup/ocrd_cis#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax requirements on pixel density in image metadata #129

Relax requirements on pixel density in image metadata #129

kba commented Nov 1, 2019

bertsky Nov 2, 2019

kba Nov 4, 2019

wrznr Nov 4, 2019

bertsky Nov 4, 2019

bertsky Nov 4, 2019

wrznr Nov 4, 2019

bertsky Nov 4, 2019

kba Nov 4, 2019

kba Nov 4, 2019

bertsky Nov 4, 2019 •

edited

Loading

cneud commented Nov 4, 2019

cneud left a comment

bertsky commented Nov 5, 2019

wrznr Nov 5, 2019

kba commented Nov 5, 2019 •

edited

Loading

cneud commented Nov 5, 2019

Relax requirements on pixel density in image metadata #129

Relax requirements on pixel density in image metadata #129

Conversation

kba commented Nov 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky Nov 4, 2019 • edited Loading

Choose a reason for hiding this comment

cneud commented Nov 4, 2019

cneud left a comment

Choose a reason for hiding this comment

bertsky commented Nov 5, 2019

Choose a reason for hiding this comment

kba commented Nov 5, 2019 • edited Loading

cneud commented Nov 5, 2019

bertsky Nov 4, 2019 •

edited

Loading

kba commented Nov 5, 2019 •

edited

Loading