-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numbers in Arabic script are getting reversed #2263
Comments
The following is the recognition with the finetuned traineddata: الجفا . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٢٧٨ |
For reference here is recognition with official traineddata. tessdata_fast الباب الحادي عشر : مطرّقات ...............2.2.2.2.2.2.2.2.2.222.2.2.2.2.2.2.... لالالم tessdata_best الجا اال ااا ااا ...مر .ممم ...م.م تتم م.م تم .متي الالامْ |
This is not the case with all RTL languages. Hebrew numbers are recognized correctly. רַאש.ונָה ראשון אַחַת אֶחָד |
@jbreiden Can you please check whether Arabic TOC image is recognized correctly at Google? Thanks! |
What was the source for finetuning? Rendered text images via text2image or 'real' images? |
Hebrew uses 0123456789. What you have in the image is words, not numbers:
first(female form) first (male) one (f) one (m). Here is an example in Hebrew:
He was born in 1962 in Haifa |
text2image via tesstrain.sh |
OK. So then this kind of issue will not apply. EDIT: Here is a test for Hebrew using cropped section from the image for issue #2207 The numbers are being recognized correctly, except for the corner case where line begins with a number (28 is recognized as 8). יתקיים ביום ראשון יייט במרחשון תשע'יט, |
@Shreeshrii |
I am still experimenting with finetuning. You can get the traineddata files from https://github.com/Shreeshrii/tessdata_arabic Note: the training_texts have not been updated in the repo yet - I have used numerals in both Arabic and English scripts, added Arabic punctuation and added a few lines in the format of the Table of Contents. Training text is about 5000 lines, Eval text is approx. 500 lines and I am doing plus-minus training using I finetuned with only one font at a time- so latest files are
On my random eval set the error rate is 3-4%. However, as noted in this issue, the numerals are in reverse order.
I am now testing finetuning with multiple fonts. |
As a test, here is another TOC in Arabic document with numbers in Latin script, image taken from https://tex.stackexchange.com/questions/213222/chapter-numbering-in-table-of-contents These seem to be recognized correctly in the finetuned traineddata. كلمة المتزجم
tessdata_best كلمة المترجم |
Fixed via #2270 Here is the display of OCRed output in notepad++ in RTL view. Original image is linked at #2263 (comment) |
See #2270 (comment) for links to test image with numbers at beginning, middle and end of line and OCR results. Thanks @amitdo for reviewing. |
@Shreeshrii |
so I would like to get from you the knowledge on how to use the finetuned data to:
|
Please see
https://github.com/tesseract-ocr/tesstrain/wiki/Arabic-Handwriting
tesseract-ocr/tesstrain#176
and
tesseract-ocr/tesstrain#128
https://github.com/Shreeshrii/tesstrain-arabic-GS
It's been almost a year since I did that training, so I suggest that you
try with a small training set to resolve the issues with punctuation and
Arabic-Indic digits.
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
…On Thu, Nov 12, 2020 at 6:57 PM mohdbm ***@***.***> wrote:
so I would like to get from you the knowledge on how to use the finetuned
data to:
1. recognize Arabic punctuation
2. recognize Arabic-Indic Digits
3. working to include more fonts
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2263 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37IZALSHDLBVJR2F3BTTSPPPETANCNFSM4GZX6H4Q>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers |
Current 4.0.0-alpha traineddata for Arabic script do not recognize numerals in Arabic script. Traineddata finetuned to include these recognizes them but reverses the order. This is probably because tesseract is treating Arabic script numerals the same as Arabic script letters in terms of directionality.
However, as per Unicode Bidirectional Algorithm basics:
The text was updated successfully, but these errors were encountered: