Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix separator line detector #3082

Merged
merged 3 commits into from
Aug 29, 2020
Merged

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Aug 24, 2020

This fixes 2 minor issues with foreground line separator detection.

Here is an example (using the ALTO renderer and PageViewer for display):

  • before (misses small vertical and large horizontal line):
    fb2_-007-_srgb tesseract-orig
  • after (fixed):
    fb2_-007-_srgb tesseract-fixed-lineseg

…ngth

When detecting vertical separators, the blob aligner is used to glue
line segments (often segmented due to artificial cracks).
But (unlike LineFinder) it has many parameters that are not
relative to pixel density/resolution.
This change decreases the minimum absolute length in pixels
for vertical separators.
…tion

When checking horizontal line partitions for
possible interpretation as underline formatting,
avoid confusing the hline partition itself with
an overlapping neighbour (which would delete it).
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 26, 2020

@bertsky Since you are fixing line detection, can you also look into incorrect sgmentation when the script extends below the baseline. See example:

3_1L_pageview

Zip file with sample images and hocr and alto output enclosed.
grantha.zip

@bertsky bertsky changed the title Fix line detector Fix separator line detector Aug 26, 2020
@bertsky
Copy link
Contributor Author

bertsky commented Aug 26, 2020

@Shreeshrii sorry, but this does not address textline detection directly. It's about the v/h-line (foreground separator) detection that is a preliminary to block/column detection. (I have renamed the title to make this difference more clear.)

But your case looks like merely an artefact of the layout extraction API or ALTO representation: bboxes are too coarse here. We should try to allow retrieving polygons. There was an (slightly off-topic) discussion about this in #2971. No dedicated issue or PR yet.

Does the recognition result also degrade where line bboxes overlap? (This is currently the best indicator of the internal row structure IIUC.)

@Shreeshrii
Copy link
Collaborator

Yes, the recognition result degrades in such cases, which has been reported earlier in other issues. I checked with Pageviewer today and noticed the overlaps.

Should I open this as a separate issue?

@bertsky
Copy link
Contributor Author

bertsky commented Aug 26, 2020

Should I open this as a separate issue?

Please do!

@stweil stweil merged commit 162f370 into tesseract-ocr:master Aug 29, 2020
@stweil
Copy link
Member

stweil commented Aug 29, 2020

Thank you, @bertsky.

@stweil
Copy link
Member

stweil commented Oct 12, 2020

These commits break the resultiterator_test. I think that test requires an update. See issue #3122.

@bertsky
Copy link
Contributor Author

bertsky commented Oct 12, 2020

These commits break the resultiterator_test. I think that test requires an update. See issue #3122.

As I have said somewhere (cannot find it) already, sorry and I'll have a look at that unit test and make another PR – as soon as I found some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants