Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax requirements on pixel density in image metadata #129

Merged
merged 1 commit into from
Nov 5, 2019

Conversation

kba
Copy link
Member

@kba kba commented Nov 1, 2019

A concession to a world of full of erroneous metadata.

Comment on lines +33 to +36
However, since technical metadata about pixel density is so often lost in
conversion or inaccurate, processors should assume **300 ppi** for images with
missing or suspiciously low pixel density metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What (exactly/programmatically) is suspiciously low?

Is 72? (72 is often used for non-print use/export. Like dfg-viewer.)

How about suspiciously high?

Shouldn't that suspicion be relative to pixel resolution, too? (An image with low density and low resolution is not as suspicious as an image with low density but high resolution. Et cetera.)

(Fundamentally, I disagree with this solution. What is the point in having these meta-data and making them a requirement, if processors cannot use them anyway? And how is this supposed to work for implementors? If the algorithm needs an accurate DPI value to function properly or to expose dimension parameters to the user, but the framework does not provide it, then results will be bad without any indication of the cause.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 72? (72 is often used for non-print use/export. Like dfg-viewer.)

No, 72 is low but not suspiciously low. Lots of older digitized material that have by default only the low-quality export images available.

What is the point in having these meta-data and making them a requirement, if processors cannot use them anyway?

Essentially, the point would be to not make this metadata a requirement because reality. The content we are targeting primarily, historical prints digitized as part of the VD-efforts, would require at least 300 ppi when adhering to the DFG guidelines on digitisation. But whether that is the case, whether these high-res images are exposed to the public in METS files and if not, whether they even still exist, is far from clear. Without accurate measurements of the digitized object (which should be part of the technical metadata but, you guessed it, is not consistently) there is no way to sensibly guess the pixel density AFAICS.

results will be bad without any indication of the cause.

low PPI would still be a warning when validating the results. Image heuristics were also part of the cancelled quality assurance bid.

I understand your frustration but I see no other way to move forward than relaxing the rules and have implementors assume that images have a certain pixel density.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba @bertsky What about the following approach: We implement a processor (maybe as an extension to the exif Wrapper) which implements a set of heuristics to detect suspicious (DPI) values. The processor also implements an auto-repair strategy (cf. tesseract-ocr/tesseract#1702 (comment)) which adds an automatically detected DPI value to the metadata (of course with a hint that the value was indeed automatically detected). Btw. in the original project layout this was the task of module 1. We now see why the DFG would have done well in funding the corresponding proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, 72 is low but not suspiciously low. Lots of older digitized material that have by default only the low-quality export images available.

Then the above formulation is too vague IMO. And it does not allow implementors (like core) to act on the problem reported on our GT, where an image was in fact 600 DPI but its meta-data said 72. We must be able to deal with this extreme diversion. You did not yet address the point about the relation between pixel density and pixel resolution (which would be relevant here).

Essentially, the point would be to not make this metadata a requirement because reality.

But there's a big difference between a datum as a requirement and a datum being correct as a requirement. The currently proposed formulation does not say when the annotated density can/must be ignored.

I understand your frustration but I see no other way to move forward than relaxing the rules and have implementors assume that images have a certain pixel density.

...have a certain fixed density? That would sacrifice quality in a large scale. (For many algorithms in segmentation and recognition, actual DPI makes a big difference. If input data varies between 72 and 600 but we always must assume 300, results will be unnecessarily bad.)

I do see the problem of not being able to make assumptions on our input data, especially since your bids for quality assurance and image characterization have been declined – which falls on everyone's feet now anyway.

But that's why I think the spec should specify very clearly when implementors can/must ignore the meta-data or even reject the input altogether. The exact heuristic formula could be left to the implementors, but its intuition should be explained here at least. So core can implement the formula within its own (first approximation to) image characterization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrznr yes, that's exactly what I think is the only way to deal with this situation. In fact it's what I proposed for core already (still without auto-repair). But here we first have to sanction and require that behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the way out of the maze is the auto-correction! Having something like that will enable the processors to rely on DPI information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can achieve the same (sans the auto-correction) by doing workspace validation on the input data. This will raise errors if pixel density is below 70 or not provided. But what then?

It's not the same thing: workspace validation can only give you an error message, but dynamic OcrdExif attributes (or static METS NISO/MIX annotation) could actually repair your workflow (by either guessing the density, setting it to unknown / 1, or raising an exception).

You did not yet address the point about the relation between pixel density and pixel resolution (which would be relevant here).

One way I see how we could detect incorrectly high pixel density is be to approximate the page dimenstions from density and resolution and warn if the page would be smaller than 10x10 cm or larger than 40x40 cm or similar.

Yes, that's exactly what I meant above – except for warn. The exact heuristics could be much more tolerant than that, and probably take the type of medium into account (newspaper can be expected to be larger etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto-correction

It doesn't solve the problem though. It means another step in the workflow and concealing the uncertainty on the pixel density. Right now, users get errors for this when validating the workspace and possibly warnings from OcrdExif. A dedicated initial step to "auto-correct" pixel density only hide these errors IMHO. Plus that is more a question of implementation. What do we expect from processor implementers? If I understand you correctly, you'd prefer to have strict rules in the specification ("If PPI < 300 or not providede, raise exception")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same thing

Yes, that's why I wrote (sans the auto-correction)

take the type of medium into account

@tboenig What would be a reliable way to achieve that from commonly used METS?

Copy link
Collaborator

@bertsky bertsky Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same thing

Yes, that's why I wrote (sans the auto-correction)

No, it's not the same thing regardless of auto-repair. You can get a better behaviour of the workflow without extra validation steps (started by the workflow engine or the processor). Everything is better than a mere warning that simply drowns in the many errors and warnings you usually see on log output – as I said: exception (in the workflow), fallback to unknown/1, fallback to repair.

auto-correction

It doesn't solve the problem though. It means another step in the workflow and concealing the uncertainty on the pixel density.

Not necessarily. @wzrnr also said earlier:

(of course with a hint that the value was indeed automatically detected)

Most processors would not care to do their own plausibility checks or even repair heuristics – they need something that works better than both the wrong value and the 300 DPI assumption on average. But some processors will offer more clever mechanisms, and they can be informed about this via extra annotation/attribute (e.g. uncertain).

Plus that is more a question of implementation. What do we expect from processor implementers? If I understand you correctly, you'd prefer to have strict rules in the specification ("If PPI < 300 or not providede, raise exception")?

No, that's not what I said or meant. not provided is still fine and PPI < 300 is too simplistic – that was my very point! If density is implausible w.r.t. resolution – and this could be laid out with or without concrete figures or rules here, but it needs to mention both – or density is implausible on its own (i.e. extreme), then act: repair, or mark as implausible, or even raise exception. I can live with any of those 3, provided it is sanctioned by the spec and therefore reliable.

(And if implementors find this hard to achive, they could always use core, even partially.)

@kba kba requested a review from cneud November 4, 2019 11:27
@cneud
Copy link
Member

cneud commented Nov 4, 2019

My 2cts:

  • image metadata such as provided by EXIF etc. can never be trusted
  • OCR-D processors should therefore always throw a warning if image ppi is <150 or not set, but continue processing the document
  • imho there is no reliable way to determine or even "repair" actual ppi post-digitization (and it is also well beyond the scope of OCR-D to fix this)
  • of course, the GT should not include such images, but assets may well include such data for testing
  • commonly OCR classifiers are optimized for 300 ppi, which is what OCR-D should follow too

Copy link
Member

@cneud cneud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with assuming 300 ppi as default, but we should throw a warning if ppi is <150 or not set in image metadata (if this is not already in core)?

@kba kba merged commit 7a568fe into OCR-D:master Nov 5, 2019
@bertsky
Copy link
Collaborator

bertsky commented Nov 5, 2019

Merging in the middle of a dispute where more interested parties have not even had the chance to participate: This is bad style!

Besides, this is an important decision with long-term consequences. I don't see the point in rushing this so hard.

@@ -30,6 +30,10 @@ $> exiftool output.tif |grep 'X Resolution'
"150"
```

However, since technical metadata about pixel density is so often lost in
conversion or inaccurate, processors should assume **300 ppi** for images with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba I thought that we agreed that I would have the chance to reformulate that a bit...

@kba
Copy link
Member Author

kba commented Nov 5, 2019

I did not want to kill the conversation. I just wanted to make pixel density a warning, not an error in core. Otherwise we can only recommend users to skip this error altogether (as we do in CI fo4 assets with the consequence that we still don't have reliably correct GT) which is not what we want.

So please consider this an ongoing task that will be fine tunes with input from @bertsky @wrznr @cneud et al. While I'm determined to release 2.0.0 today, there's no reason a 2.1.0 or similar cannot remedy your grievances.

@cneud
Copy link
Member

cneud commented Nov 5, 2019

100% ACK @kba.

Please @bertsky @wrznr consider the time pressure on us as we push for a release with the requested API changes & additions ready for the library pilots starting from tomorrow. Certainly your feedback is very welcome for >2.0.0, but we just had to act fast in this case.

bertsky referenced this pull request in ocr-d-modul-2-segmentierung/ocrd-pixelclassifier-segmentation Nov 19, 2019
Most actual functionality is moved to the page-segmentation project.
This project now only implements the ocrd-compliant wrapper around it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants