Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polygonalize segments by shrinking to children #162

Merged
merged 10 commits into from
Dec 10, 2020

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Dec 5, 2020

This implements an idea I had long time ago to at least get some polygons out of Tesseract, reducing the large overlap between bounding boxes. If you set shrink_polygons, then segment* and recognize will post-process their segmentation by projecting convex hull upwards from textequiv_level to segmentation_level.

For example, ocrd-tesserocr-segment now yields:
tesserocr-polygons

After segmentation is done for the page, enter the
(most granular) lowest hierarchy level which has been
processed (i.e. textequiv_level) and project the hull
of its constituent outlines upwards to the highest
(least granular) hierarchy level on which segments
have been added (i.e. segmentation_level). In effect,
segments will have tight polygons instead of coarse
bounding boxes, with fewer unnecessary overlaps
between neighbours.
@codecov
Copy link

codecov bot commented Dec 5, 2020

Codecov Report

Merging #162 (9e1c4c2) into master (056d30d) will decrease coverage by 9.40%.
The diff coverage is 19.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage   40.58%   31.18%   -9.41%     
==========================================
  Files          11       11              
  Lines        1126     1209      +83     
  Branches      236      277      +41     
==========================================
- Hits          457      377      -80     
- Misses        585      758     +173     
+ Partials       84       74      -10     
Impacted Files Coverage Δ
ocrd_tesserocr/segment.py 37.50% <0.00%> (-1.64%) ⬇️
ocrd_tesserocr/segment_table.py 36.00% <0.00%> (ø)
ocrd_tesserocr/recognize.py 31.13% <16.96%> (-17.16%) ⬇️
ocrd_tesserocr/segment_line.py 96.00% <100.00%> (ø)
ocrd_tesserocr/segment_region.py 96.29% <100.00%> (ø)
ocrd_tesserocr/segment_word.py 96.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 056d30d...9e1c4c2. Read the comment docs.

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 5, 2020

Note that the big downside of implementing shrink_polygons as a post-processor iterating through the PAGE hierarchy instead of right away through Tesseract hierarchy is that you only get any actual polygons if you use multi-level segmentation (e.g. regions and lines and words). But in that case, you will already avoid conflicting/overlapping lines. However, where shrinking would be most needed is when (for some reason) you can only employ Tesseract for a single level, e.g. ocrd-tesserocr-segment-region or ocrd-tesserocr-segment-line.

Here's an illustration:

  • region segmentation only (nothing to shrink to; only bounding boxes)
    tesserocr-polygons2-block
  • region and line segmentation (shrinks regions to lines; coarse polygons)
    tesserocr-polygons2-line
  • region, line and word segmentation (shrinks lines to words, then regions to lines; better polygons)
    tesserocr-polygons2-word
  • region, line, word and glyph segmentation (shrinks words to glyphs, lines to words, then regions to lines; tight polygons)
    tesserocr-polygons2

To get tight polygons when segmenting regions only, going top-down instead of bottom-up, we would have to query the Tesseract result iterator down to the glyph level for each block and get the hull of all segments, but then reset the iterator to the start of the block. Unfortunately, Tesseract's API does not contain a reset function for the block level, only RestartParagraph, RestartRow, BeginWord. And since iterators are CPython objects, we also cannot just make a copy of them in another state...

Instead of post-processing on the PAGE hierarchy, get
convex hull polygon from all constituent symbols/glyphs
(i.e. the lowest level) on the Tesseract iterator hierarchy
directly.
Thus, segments will have tight polygons not only when
processing across multiple levels, but also when annotating
results for a a single PAGE level. So the more granular
results from Tesseract will at least be used implicitly.

Also exposes this option as parameter to the other processors.
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 6, 2020

To get tight polygons when segmenting regions only, going top-down instead of bottom-up, we would have to query the Tesseract result iterator down to the glyph level for each block and get the hull of all segments, but then reset the iterator to the start of the block. Unfortunately, Tesseract's API does not contain a reset function for the block level, only RestartParagraph, RestartRow, BeginWord. And since iterators are CPython objects, we also cannot just make a copy of them in another state...

Good news: I did find a solution for this. I emulated a RestartBlock (not defined in Tesseract) and BeginWord (not exposed in tesserocr) by simply resetting the iterator to the next higher level and moving forward to the current position again. Instead of post-processing PAGE projecting from level to level upwards, it now always takes the hull of the lowest PageIterator level (symbols) directly.

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 6, 2020

The bad news is that I also made a big conceptual mistake in #158: My parameterization segmentation_level for segmentation entry point and textequiv_level for segmentation exit point – as well as recognition oplevel – (with none for no recognition but maximal segmentation) makes it impossible to get single-level segmentation without recognition. So for example, ocrd-tesserocr-segment-region suddenly gives you lines, words and glyphs, but ocrd-tesserocr-recognize -P segmentation_level region -P textequiv_level region computes OCR and annotates text. Sorry about that!

Here's an idea how to fix this (without throwing it all away): We could separate the question of whether to do only segmentation (AnalyseLayout()) or segmentation plus recognition (Recognize()) from the textequiv_level parameter, and let the model parameter control this implicitly: If model is empty, then only segmentation, otherwise segmentation+recognition. (AFAIK the layout analysis can live with any model, so we could keep the current default, which loads the last installed model, or take eng.) So:

  • ocrd-tesserocr-segment-region would use textequiv_level='region' model=None
  • ocrd-tesserocr-segment-line would use textequiv_level='line' model=None
  • ocrd-tesserocr-segment-word would use textequiv_level='word' model=None
  • ocrd-tesserocr-segment would use textequiv_level='word' model=None
  • ocrd-tesserocr-recognize would use textequiv_level='word' model=last by default or whatever you told it to

Opinions?

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 7, 2020

We should round this off before merging 0.10 in ocrd_all and adapting all documentation. @kba @stweil @wrznr

@kba
Copy link
Member

kba commented Dec 7, 2020

The bad news is that I also made a big conceptual mistake in #158: My parameterization segmentation_level for segmentation entry point and textequiv_level for segmentation exit point – as well as recognition oplevel – (with none for no recognition but maximal segmentation) makes it impossible to get single-level segmentation without recognition. So for example, ocrd-tesserocr-segment-region suddenly gives you lines, words and glyphs, but ocrd-tesserocr-recognize -P segmentation_level region -P textequiv_level region computes OCR and annotates text. Sorry about that!

Oops. But good that you noticed before ocrd_all merge/documentation update.

Here's an idea how to fix this (without throwing it all away): We could separate the question of whether to do only segmentation (AnalyseLayout()) or segmentation plus recognition (Recognize()) from the textequiv_level parameter, and let the model parameter control this implicitly: If model is empty, then only segmentation, otherwise segmentation+recognition. (AFAIK the layout analysis can live with any model, so we could keep the current default, which loads the last installed model, or take eng.) So:

  • ocrd-tesserocr-segment-region would use textequiv_level='region' model=None
  • ocrd-tesserocr-segment-line would use textequiv_level='line' model=None
  • ocrd-tesserocr-segment-word would use textequiv_level='word' model=None
  • ocrd-tesserocr-segment would use textequiv_level='word' model=None
  • ocrd-tesserocr-recognize would use textequiv_level='word' model=last by default or whatever you told it to

Opinions?

Makes sense. If users want text recognition they will also explicitly provide the model(s) they want to use for recognition, so this is a reasonable convention. From my side, I'd be content with this segmentation_level/textequiv_level/model logic.

- recognize: do not attempt `Recognize()` (only `AnalyseLayout()`)
  or modify `TextEquiv` if `model` parameter is empty (just like for
  `textequiv_level==none`, except the latter still attempts maximal
  segmentation)
- segment-{region,line,word}: Apply single-level segmentation (i.e.
  `textequiv_level!=none`) without recognition (i.e. empty `model`)
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 7, 2020

Makes sense. If users want text recognition they will also explicitly provide the model(s) they want to use for recognition, so this is a reasonable convention. From my side, I'd be content with this segmentation_level/textequiv_level/model logic.

See d1ffd70. I still kept the textequiv_level='none' logic as an alternative.

Tesseract's `PageIterator` / `ResultIterator` navigation has 2 bugs
which makes them hardly usable: they are not equipped to cleanly
handle the case when
- a non-text block is entered at the PARA/TEXTLINE/WORD level, or
- an empty word (only rejections) is entered at the SYMBOL level.
In particular, `IsAtFinalElement` and `Empty` prematurely signal
True at the lower (but not higher!) levels, the latter w.r.t. the
current and the former w.r.t. the next segment.
Using the API in a naïve / straightforward way would cause all
follow-up segments to be lost, respectively.
This contains a workaround on the API level.
@@ -72,7 +73,7 @@ class TesserocrRecognize(Processor):

def __init__(self, *args, **kwargs):
kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
kwargs['version'] = OCRD_TOOL['version']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! We should probably also do that for ocrd_calamari

@kba
Copy link
Member

kba commented Dec 10, 2020

I've tested this against a few examples from our GT and did not notice any problems with shrink_polygons, even noticed an error in the GT:

image

So from my point of view, this can be merged and we can update the documentation and ocrd_all.

@bertsky bertsky merged commit a5e8b84 into OCR-D:master Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants