Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more geometry heuristics for validate/repair #5

Open
bertsky opened this issue Jun 25, 2019 · 14 comments
Open

more geometry heuristics for validate/repair #5

bertsky opened this issue Jun 25, 2019 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jun 25, 2019

We should have heuristics to check for

  • polygon containment (overlapping regions, word outside line etc.)
  • artifacts from annotation like point or line-like regions
  • lines with (way) too much whitespace (bad cropping, or bad segmentation)
  • probably even: missing @orientation

Originally posted by @kba in OCR-D/assets#28 (comment)

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 26, 2019

BTW, shapely.geometry.polygon.Polygon has very nice API for the first 2 tasks, including contains() and area().

The third could be achieved with ad-hoc binarization and some simple Numpy statistics like count_nonzero() (i.e. pixel-counting), or nonzero() followed by amin() and amax() to get non-white bounds (i.e. area-counting).

And orientation checking could be done in a similar way like deskewing (i.e. entropy based), but with some kind of confidence measure.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 18, 2019

A good reference for additional checks are the validation error classes in Aletheia, p. 118/119.

@kba
Copy link
Member

kba commented Aug 7, 2019

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 14, 2019

@kba
Copy link
Member

kba commented Aug 15, 2019

https://github.com/OCR-D/ocrd_segment is a better place for this.

@kba kba closed this as completed Aug 15, 2019
@bertsky bertsky transferred this issue from OCR-D/core Aug 15, 2019
@bertsky bertsky changed the title extend PAGE validator with geometry heuristics more geometry heuristics for validate/repair Aug 15, 2019
@bertsky bertsky added the enhancement New feature or request label Aug 15, 2019
@bertsky
Copy link
Collaborator Author

bertsky commented Aug 15, 2019

Moved the original issue from core here to have a better reminder of what is left to do.

Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)

@bertsky bertsky reopened this Aug 15, 2019
@bertsky
Copy link
Collaborator Author

bertsky commented Aug 16, 2019

(We do not yet check whether elements are properly contained within their parents' outline.)

And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2019

Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)

And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?

With #15 we now have covered the first item, except for repair. So far, we can only repair:

  • overlapping regions (with plausibilize=True) when near-equal or properly contained
    (but not near-contained or partial overlap)
  • lines extending from regions (with sanitize=True) by overwriting the region polygon with a hull of the lines
    (but not the other way, and not on the other levels)

@wrznr
Copy link
Collaborator

wrznr commented Nov 12, 2019

Partial Overlap of region a and b

  1. Merge a and b if of same type
  2. Shrink b to non-overlapping part (i.e. difference) if a is of type text
  3. Vice versa b
  4. Else?

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2019

Partial Overlap of region a and b

1. Merge `a` and `b` if of same type

Yes, but for text regions we would need to bring in the concept of Allowable Merge (w.r.t. ReadingOrder and @readingDirection|@readingOrientation) first:

A merge is allowed iff a and b are direct successors in the reading order, and they have equal reading direction, and its axis (i.e. horizontal vs vertical) is orthogonal to the axis on which both bounding boxes deviate most.

And if a merge is not allowed between two overlapping text regions, then the intersecting foreground should somehow fall into that region which it is most consistent with (i.e. regarding its alignment and center of mass).

  1. Shrink b to non-overlapping part (i.e. difference) if a is of type text

  2. Vice versa b

  3. Else?

If a and b are of different, both non-text type, I'd say it does not matter.

BTW, do we want to go into the complexities of using PAGE-XML's Layers? (Then we could avoid changing the coordinates altogether, and would merely have to decide on @zIndex ordering...

@wrznr
Copy link
Collaborator

wrznr commented Nov 12, 2019

Layers

I fear this implies drastic changes to core. Let's better do not for now.

ReadingOrder

We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary! Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2019

Layers

I fear this implies drastic changes to core. Let's better do not for now.

Agreed. (The way this is formalised in PAGE-XML, it would still be impossible to separate/suppress foreground automatically.)

ReadingOrder

We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary!

I disagree. Even if we don't know the reading order, that's a separate problem. No RO equals default RO (i.e. XML element order), right? Whatever the RO in the document, the repair decision always depends on it.

Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.

Fixing RO is another problem/step. And especially when we have overlapping regions, this becomes circular if all we can do is heuristics.

IMHO a good RO detection would have to be data-driven, and informed by the precise @type (and possible @custom sub-type) of the regions.

@wrznr
Copy link
Collaborator

wrznr commented Nov 12, 2019

No RO equals default RO

Actually, I think that indeed RO = default RO. But your right, we should not base hacks on hacks.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2019

No RO equals default RO

But your right, we should not base hacks on hacks.

Well, or maybe just a little: Let's say we have a region segmentation like Tesseract that can output reading direction within regions (via orientation analysis), but is really bad on reading order between regions – creating XML elements more or less in random order. (The same could happen with a NN module without RO.)

Now strictly when repairing we would be unable to merge or split most of the time (because 2 neighbouring/overlapping regions are XML successors only by chance). But we could still repair the unambiguous cases if we first added a new RO based on a top-down-left-to-right assumption (treating overlapping regions as neighbours), ... I think. At least as an extra option for the desparate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants