-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax requirements on pixel density in image metadata #129
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,10 @@ $> exiftool output.tif |grep 'X Resolution' | |
"150" | ||
``` | ||
|
||
However, since technical metadata about pixel density is so often lost in | ||
conversion or inaccurate, processors should assume **300 ppi** for images with | ||
missing or suspiciously low pixel density metadata. | ||
|
||
Comment on lines
+33
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What (exactly/programmatically) is Is 72? (72 is often used for non-print use/export. Like dfg-viewer.) How about suspiciously high? Shouldn't that suspicion be relative to pixel resolution, too? (An image with low density and low resolution is not as suspicious as an image with low density but high resolution. Et cetera.) (Fundamentally, I disagree with this solution. What is the point in having these meta-data and making them a requirement, if processors cannot use them anyway? And how is this supposed to work for implementors? If the algorithm needs an accurate DPI value to function properly or to expose dimension parameters to the user, but the framework does not provide it, then results will be bad without any indication of the cause.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, 72 is low but not suspiciously low. Lots of older digitized material that have by default only the low-quality export images available.
Essentially, the point would be to not make this metadata a requirement because reality. The content we are targeting primarily, historical prints digitized as part of the VD-efforts, would require at least 300 ppi when adhering to the DFG guidelines on digitisation. But whether that is the case, whether these high-res images are exposed to the public in METS files and if not, whether they even still exist, is far from clear. Without accurate measurements of the digitized object (which should be part of the technical metadata but, you guessed it, is not consistently) there is no way to sensibly guess the pixel density AFAICS.
low PPI would still be a warning when validating the results. Image heuristics were also part of the cancelled quality assurance bid. I understand your frustration but I see no other way to move forward than relaxing the rules and have implementors assume that images have a certain pixel density. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kba @bertsky What about the following approach: We implement a processor (maybe as an extension to the exif Wrapper) which implements a set of heuristics to detect suspicious (DPI) values. The processor also implements an auto-repair strategy (cf. tesseract-ocr/tesseract#1702 (comment)) which adds an automatically detected DPI value to the metadata (of course with a hint that the value was indeed automatically detected). Btw. in the original project layout this was the task of module 1. We now see why the DFG would have done well in funding the corresponding proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Then the above formulation is too vague IMO. And it does not allow implementors (like core) to act on the problem reported on our GT, where an image was in fact 600 DPI but its meta-data said 72. We must be able to deal with this extreme diversion. You did not yet address the point about the relation between pixel density and pixel resolution (which would be relevant here).
But there's a big difference between a datum as a requirement and a datum being correct as a requirement. The currently proposed formulation does not say when the annotated density can/must be ignored.
...have a certain fixed density? That would sacrifice quality in a large scale. (For many algorithms in segmentation and recognition, actual DPI makes a big difference. If input data varies between 72 and 600 but we always must assume 300, results will be unnecessarily bad.) I do see the problem of not being able to make assumptions on our input data, especially since your bids for quality assurance and image characterization have been declined – which falls on everyone's feet now anyway. But that's why I think the spec should specify very clearly when implementors can/must ignore the meta-data or even reject the input altogether. The exact heuristic formula could be left to the implementors, but its intuition should be explained here at least. So core can implement the formula within its own (first approximation to) image characterization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But the way out of the maze is the auto-correction! Having something like that will enable the processors to rely on DPI information. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's not the same thing: workspace validation can only give you an error message, but dynamic OcrdExif attributes (or static METS NISO/MIX annotation) could actually repair your workflow (by either guessing the density, setting it to unknown / 1, or raising an exception).
Yes, that's exactly what I meant above – except for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It doesn't solve the problem though. It means another step in the workflow and concealing the uncertainty on the pixel density. Right now, users get errors for this when validating the workspace and possibly warnings from OcrdExif. A dedicated initial step to "auto-correct" pixel density only hide these errors IMHO. Plus that is more a question of implementation. What do we expect from processor implementers? If I understand you correctly, you'd prefer to have strict rules in the specification ("If PPI < 300 or not providede, raise exception")? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I wrote
@tboenig What would be a reliable way to achieve that from commonly used METS? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, it's not the same thing regardless of auto-repair. You can get a better behaviour of the workflow without extra validation steps (started by the workflow engine or the processor). Everything is better than a mere warning that simply drowns in the many errors and warnings you usually see on log output – as I said: exception (in the workflow), fallback to unknown/1, fallback to repair.
Not necessarily. @wzrnr also said earlier:
Most processors would not care to do their own plausibility checks or even repair heuristics – they need something that works better than both the wrong value and the 300 DPI assumption on average. But some processors will offer more clever mechanisms, and they can be informed about this via extra annotation/attribute (e.g.
No, that's not what I said or meant. (And if implementors find this hard to achive, they could always use core, even partially.) |
||
## Unique ID for the document processed | ||
|
||
METS provided to the MP must be uniquely addressable within the global library community. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kba I thought that we agreed that I would have the chance to reformulate that a bit...