Polygonalize segments by shrinking to children #162

bertsky · 2020-12-05T10:34:51Z

This implements an idea I had long time ago to at least get some polygons out of Tesseract, reducing the large overlap between bounding boxes. If you set shrink_polygons, then segment* and recognize will post-process their segmentation by projecting convex hull upwards from textequiv_level to segmentation_level.

For example, ocrd-tesserocr-segment now yields:

After segmentation is done for the page, enter the (most granular) lowest hierarchy level which has been processed (i.e. textequiv_level) and project the hull of its constituent outlines upwards to the highest (least granular) hierarchy level on which segments have been added (i.e. segmentation_level). In effect, segments will have tight polygons instead of coarse bounding boxes, with fewer unnecessary overlaps between neighbours.

codecov · 2020-12-05T10:37:22Z

Codecov Report

Merging #162 (9e1c4c2) into master (056d30d) will decrease coverage by 9.40%.
The diff coverage is 19.49%.

@@            Coverage Diff             @@
##           master     #162      +/-   ##
==========================================
- Coverage   40.58%   31.18%   -9.41%     
==========================================
  Files          11       11              
  Lines        1126     1209      +83     
  Branches      236      277      +41     
==========================================
- Hits          457      377      -80     
- Misses        585      758     +173     
+ Partials       84       74      -10

Impacted Files	Coverage Δ
ocrd_tesserocr/segment.py	`37.50% <0.00%> (-1.64%)`	⬇️
ocrd_tesserocr/segment_table.py	`36.00% <0.00%> (ø)`
ocrd_tesserocr/recognize.py	`31.13% <16.96%> (-17.16%)`	⬇️
ocrd_tesserocr/segment_line.py	`96.00% <100.00%> (ø)`
ocrd_tesserocr/segment_region.py	`96.29% <100.00%> (ø)`
ocrd_tesserocr/segment_word.py	`96.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 056d30d...9e1c4c2. Read the comment docs.

bertsky · 2020-12-05T16:40:16Z

Note that the big downside of implementing shrink_polygons as a post-processor iterating through the PAGE hierarchy instead of right away through Tesseract hierarchy is that you only get any actual polygons if you use multi-level segmentation (e.g. regions and lines and words). But in that case, you will already avoid conflicting/overlapping lines. However, where shrinking would be most needed is when (for some reason) you can only employ Tesseract for a single level, e.g. ocrd-tesserocr-segment-region or ocrd-tesserocr-segment-line.

Here's an illustration:

region segmentation only (nothing to shrink to; only bounding boxes)
region and line segmentation (shrinks regions to lines; coarse polygons)
region, line and word segmentation (shrinks lines to words, then regions to lines; better polygons)
region, line, word and glyph segmentation (shrinks words to glyphs, lines to words, then regions to lines; tight polygons)

To get tight polygons when segmenting regions only, going top-down instead of bottom-up, we would have to query the Tesseract result iterator down to the glyph level for each block and get the hull of all segments, but then reset the iterator to the start of the block. Unfortunately, Tesseract's API does not contain a reset function for the block level, only RestartParagraph, RestartRow, BeginWord. And since iterators are CPython objects, we also cannot just make a copy of them in another state...

Instead of post-processing on the PAGE hierarchy, get convex hull polygon from all constituent symbols/glyphs (i.e. the lowest level) on the Tesseract iterator hierarchy directly. Thus, segments will have tight polygons not only when processing across multiple levels, but also when annotating results for a a single PAGE level. So the more granular results from Tesseract will at least be used implicitly. Also exposes this option as parameter to the other processors.

bertsky · 2020-12-06T20:36:09Z

To get tight polygons when segmenting regions only, going top-down instead of bottom-up, we would have to query the Tesseract result iterator down to the glyph level for each block and get the hull of all segments, but then reset the iterator to the start of the block. Unfortunately, Tesseract's API does not contain a reset function for the block level, only RestartParagraph, RestartRow, BeginWord. And since iterators are CPython objects, we also cannot just make a copy of them in another state...

Good news: I did find a solution for this. I emulated a RestartBlock (not defined in Tesseract) and BeginWord (not exposed in tesserocr) by simply resetting the iterator to the next higher level and moving forward to the current position again. Instead of post-processing PAGE projecting from level to level upwards, it now always takes the hull of the lowest PageIterator level (symbols) directly.

bertsky · 2020-12-06T20:51:25Z

The bad news is that I also made a big conceptual mistake in #158: My parameterization segmentation_level for segmentation entry point and textequiv_level for segmentation exit point – as well as recognition oplevel – (with none for no recognition but maximal segmentation) makes it impossible to get single-level segmentation without recognition. So for example, ocrd-tesserocr-segment-region suddenly gives you lines, words and glyphs, but ocrd-tesserocr-recognize -P segmentation_level region -P textequiv_level region computes OCR and annotates text. Sorry about that!

Here's an idea how to fix this (without throwing it all away): We could separate the question of whether to do only segmentation (AnalyseLayout()) or segmentation plus recognition (Recognize()) from the textequiv_level parameter, and let the model parameter control this implicitly: If model is empty, then only segmentation, otherwise segmentation+recognition. (AFAIK the layout analysis can live with any model, so we could keep the current default, which loads the last installed model, or take eng.) So:

ocrd-tesserocr-segment-region would use textequiv_level='region' model=None
ocrd-tesserocr-segment-line would use textequiv_level='line' model=None
ocrd-tesserocr-segment-word would use textequiv_level='word' model=None
ocrd-tesserocr-segment would use textequiv_level='word' model=None
ocrd-tesserocr-recognize would use textequiv_level='word' model=last by default or whatever you told it to

Opinions?

bertsky · 2020-12-07T10:56:21Z

We should round this off before merging 0.10 in ocrd_all and adapting all documentation. @kba @stweil @wrznr

kba · 2020-12-07T11:10:59Z

The bad news is that I also made a big conceptual mistake in #158: My parameterization segmentation_level for segmentation entry point and textequiv_level for segmentation exit point – as well as recognition oplevel – (with none for no recognition but maximal segmentation) makes it impossible to get single-level segmentation without recognition. So for example, ocrd-tesserocr-segment-region suddenly gives you lines, words and glyphs, but ocrd-tesserocr-recognize -P segmentation_level region -P textequiv_level region computes OCR and annotates text. Sorry about that!

Oops. But good that you noticed before ocrd_all merge/documentation update.

Here's an idea how to fix this (without throwing it all away): We could separate the question of whether to do only segmentation (AnalyseLayout()) or segmentation plus recognition (Recognize()) from the textequiv_level parameter, and let the model parameter control this implicitly: If model is empty, then only segmentation, otherwise segmentation+recognition. (AFAIK the layout analysis can live with any model, so we could keep the current default, which loads the last installed model, or take eng.) So:

ocrd-tesserocr-segment-region would use textequiv_level='region' model=None

ocrd-tesserocr-segment-line would use textequiv_level='line' model=None

ocrd-tesserocr-segment-word would use textequiv_level='word' model=None

ocrd-tesserocr-segment would use textequiv_level='word' model=None

ocrd-tesserocr-recognize would use textequiv_level='word' model=last by default or whatever you told it to

Opinions?

Makes sense. If users want text recognition they will also explicitly provide the model(s) they want to use for recognition, so this is a reasonable convention. From my side, I'd be content with this segmentation_level/textequiv_level/model logic.

- recognize: do not attempt `Recognize()` (only `AnalyseLayout()`) or modify `TextEquiv` if `model` parameter is empty (just like for `textequiv_level==none`, except the latter still attempts maximal segmentation) - segment-{region,line,word}: Apply single-level segmentation (i.e. `textequiv_level!=none`) without recognition (i.e. empty `model`)

bertsky · 2020-12-07T13:31:06Z

Makes sense. If users want text recognition they will also explicitly provide the model(s) they want to use for recognition, so this is a reasonable convention. From my side, I'd be content with this segmentation_level/textequiv_level/model logic.

See d1ffd70. I still kept the textequiv_level='none' logic as an alternative.

Tesseract's `PageIterator` / `ResultIterator` navigation has 2 bugs which makes them hardly usable: they are not equipped to cleanly handle the case when - a non-text block is entered at the PARA/TEXTLINE/WORD level, or - an empty word (only rejections) is entered at the SYMBOL level. In particular, `IsAtFinalElement` and `Empty` prematurely signal True at the lower (but not higher!) levels, the latter w.r.t. the current and the former w.r.t. the next segment. Using the API in a naïve / straightforward way would cause all follow-up segments to be lost, respectively. This contains a workaround on the API level.

kba · 2020-12-10T11:37:04Z

ocrd_tesserocr/recognize.py

@@ -72,7 +73,7 @@ class TesserocrRecognize(Processor):

    def __init__(self, *args, **kwargs):
        kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
-        kwargs['version'] = OCRD_TOOL['version']


Good idea! We should probably also do that for ocrd_calamari

kba · 2020-12-10T14:13:24Z

I've tested this against a few examples from our GT and did not notice any problems with shrink_polygons, even noticed an error in the GT:

So from my point of view, this can be merged and we can update the documentation and ocrd_all.

bertsky added 2 commits December 5, 2020 11:05

segment*/recognize: reduce minimal height of detected blocks

bf63228

segment*/recognize: docstrings/logging for shrinking

381fe76

bertsky force-pushed the polygonalize branch from 224c870 to d1ffd70 Compare December 7, 2020 13:29

bertsky added 4 commits December 8, 2020 02:50

segment*/recognize: when shrink_polygons, skip empty result iterators

c6b43e2

segment*/recognize: add Tesseract version to meta-data

aa6eae8

segment*/recognize: fix segmentation exit point for cell

505aa80

kba reviewed Dec 10, 2020

View reviewed changes

recognize: expose new parameter tesseract_parameters (dict)

9e1c4c2

bertsky merged commit a5e8b84 into OCR-D:master Dec 10, 2020

bertsky mentioned this pull request Feb 8, 2021

segment-line: annotate polygon or clipped image #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polygonalize segments by shrinking to children #162

Polygonalize segments by shrinking to children #162

bertsky commented Dec 5, 2020

codecov bot commented Dec 5, 2020 •

edited

Loading

bertsky commented Dec 5, 2020

bertsky commented Dec 6, 2020

bertsky commented Dec 6, 2020

bertsky commented Dec 7, 2020

kba commented Dec 7, 2020

bertsky commented Dec 7, 2020

kba Dec 10, 2020

kba commented Dec 10, 2020

Polygonalize segments by shrinking to children #162

Polygonalize segments by shrinking to children #162

Conversation

bertsky commented Dec 5, 2020

codecov bot commented Dec 5, 2020 • edited Loading

Codecov Report

bertsky commented Dec 5, 2020

bertsky commented Dec 6, 2020

bertsky commented Dec 6, 2020

bertsky commented Dec 7, 2020

kba commented Dec 7, 2020

bertsky commented Dec 7, 2020

kba Dec 10, 2020

Choose a reason for hiding this comment

kba commented Dec 10, 2020

codecov bot commented Dec 5, 2020 •

edited

Loading