Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test of arabic_lines.c and misctest1.c #236

Closed
Shreeshrii opened this issue Feb 4, 2017 · 4 comments
Closed

test of arabic_lines.c and misctest1.c #236

Shreeshrii opened this issue Feb 4, 2017 · 4 comments

Comments

@Shreeshrii
Copy link

Ref: tesseract-ocr/tesseract#657 (comment)

I built leptonica with the latest code and gave it a try for both the sample arabic and devanagari pages using arabic_lines.c and misctest1.c.

What I found is that while the devanagari page was split correctly using arabic_lines.c, the arabic page was split correctly using misctest1.c.

Devanagari line splitting using misctest1.c is missing whole chunks of text.

Arabic line splitting using arabic_lines.c merged one set of two lines.

Files are attached.

lines2
result

lines2
result

@Shreeshrii
Copy link
Author

Here is another sample using Telugu - another Indian language.

lines2
result

@DanBloomberg
Copy link
Owner

Yes, I am aware that the results from the two similar methods are different, and that the arabic_lines.c method joins those two lines in the first image.

These methods are not intended to be "industrial strength" line extractors; just illustrations of how binary morphology can be used to do these segmentation tasks. I believe that pixExtractTextlines() is not used in tesseract.

With respect to the first image (arabic with a few examples of a complex character, single column): (1) what does it say, and (2) would there be any reason not to use this as a test file in leptonica/prog?

-- Dan

@Shreeshrii
Copy link
Author

Shreeshrii commented Feb 5, 2017

Dan, Thanks for your work on this. I am hoping it will lead to better line detection in tesseract.

I do not know Arabic. The image I used for testing was provided by @bmwmy in issue titled 'Arabic lang. feature request' tesseract-ocr/tesseract#552

The image is linked at https://drive.google.com/file/d/0B1JdJ8IXNweRcG1mRUpkU1dFMVJPUGV3YVlkT0JxYnBlTGhn/view?usp=sharing
and also provided as part of a trainingdata.zip file at https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing

@DanBloomberg
Copy link
Owner

I've added another function for extracting textlines, pixExtractRawTextlines(), which is much more aggressive in joining components. It does not respect different columns of text (i.e., it aggressively joins them!)

It doesn't resolve this issue (actually, goes in the other direction), but is useful in other situations.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants