-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test of arabic_lines.c and misctest1.c #236
Comments
Yes, I am aware that the results from the two similar methods are different, and that the arabic_lines.c method joins those two lines in the first image. These methods are not intended to be "industrial strength" line extractors; just illustrations of how binary morphology can be used to do these segmentation tasks. I believe that pixExtractTextlines() is not used in tesseract. With respect to the first image (arabic with a few examples of a complex character, single column): (1) what does it say, and (2) would there be any reason not to use this as a test file in leptonica/prog? -- Dan |
Dan, Thanks for your work on this. I am hoping it will lead to better line detection in tesseract. I do not know Arabic. The image I used for testing was provided by @bmwmy in issue titled 'Arabic lang. feature request' tesseract-ocr/tesseract#552 The image is linked at https://drive.google.com/file/d/0B1JdJ8IXNweRcG1mRUpkU1dFMVJPUGV3YVlkT0JxYnBlTGhn/view?usp=sharing |
I've added another function for extracting textlines, pixExtractRawTextlines(), which is much more aggressive in joining components. It does not respect different columns of text (i.e., it aggressively joins them!) It doesn't resolve this issue (actually, goes in the other direction), but is useful in other situations. Closing. |
Ref: tesseract-ocr/tesseract#657 (comment)
I built leptonica with the latest code and gave it a try for both the sample arabic and devanagari pages using arabic_lines.c and misctest1.c.
What I found is that while the devanagari page was split correctly using arabic_lines.c, the arabic page was split correctly using misctest1.c.
Devanagari line splitting using misctest1.c is missing whole chunks of text.
Arabic line splitting using arabic_lines.c merged one set of two lines.
Files are attached.
The text was updated successfully, but these errors were encountered: