test of arabic_lines.c and misctest1.c #236

Shreeshrii · 2017-02-04T09:40:29Z

Ref: tesseract-ocr/tesseract#657 (comment)

I built leptonica with the latest code and gave it a try for both the sample arabic and devanagari pages using arabic_lines.c and misctest1.c.

What I found is that while the devanagari page was split correctly using arabic_lines.c, the arabic page was split correctly using misctest1.c.

Devanagari line splitting using misctest1.c is missing whole chunks of text.

Arabic line splitting using arabic_lines.c merged one set of two lines.

Files are attached.

Shreeshrii · 2017-02-04T09:48:33Z

Here is another sample using Telugu - another Indian language.

DanBloomberg · 2017-02-04T21:39:09Z

Yes, I am aware that the results from the two similar methods are different, and that the arabic_lines.c method joins those two lines in the first image.

These methods are not intended to be "industrial strength" line extractors; just illustrations of how binary morphology can be used to do these segmentation tasks. I believe that pixExtractTextlines() is not used in tesseract.

With respect to the first image (arabic with a few examples of a complex character, single column): (1) what does it say, and (2) would there be any reason not to use this as a test file in leptonica/prog?

-- Dan

Shreeshrii · 2017-02-05T03:33:37Z

Dan, Thanks for your work on this. I am hoping it will lead to better line detection in tesseract.

I do not know Arabic. The image I used for testing was provided by @bmwmy in issue titled 'Arabic lang. feature request' tesseract-ocr/tesseract#552

The image is linked at https://drive.google.com/file/d/0B1JdJ8IXNweRcG1mRUpkU1dFMVJPUGV3YVlkT0JxYnBlTGhn/view?usp=sharing
and also provided as part of a trainingdata.zip file at https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing

DanBloomberg · 2017-04-18T14:44:30Z

I've added another function for extracting textlines, pixExtractRawTextlines(), which is much more aggressive in joining components. It does not respect different columns of text (i.e., it aggressively joins them!)

It doesn't resolve this issue (actually, goes in the other direction), but is useful in other situations.

Closing.

DanBloomberg closed this as completed Apr 18, 2017

Shreeshrii mentioned this issue Aug 15, 2018

Synthetical comparison with Abbyy tesseract-ocr/tessdata#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test of arabic_lines.c and misctest1.c #236

test of arabic_lines.c and misctest1.c #236

Shreeshrii commented Feb 4, 2017

Shreeshrii commented Feb 4, 2017

DanBloomberg commented Feb 4, 2017

Shreeshrii commented Feb 5, 2017 •

edited

Loading

DanBloomberg commented Apr 18, 2017

test of arabic_lines.c and misctest1.c #236

test of arabic_lines.c and misctest1.c #236

Comments

Shreeshrii commented Feb 4, 2017

Shreeshrii commented Feb 4, 2017

DanBloomberg commented Feb 4, 2017

Shreeshrii commented Feb 5, 2017 • edited Loading

DanBloomberg commented Apr 18, 2017

Shreeshrii commented Feb 5, 2017 •

edited

Loading