Numbers in Arabic script are getting reversed #2263

Shreeshrii · 2019-02-23T15:03:09Z

Current 4.0.0-alpha traineddata for Arabic script do not recognize numerals in Arabic script. Traineddata finetuned to include these recognizes them but reverses the order. This is probably because tesseract is treating Arabic script numerals the same as Arabic script letters in terms of directionality.

However, as per Unicode Bidirectional Algorithm basics:

Numbers
A quick word about numbers. Numbers in RTL scripts run left-to-right within the right-to-left flow, but they are handled by the bidi algorithm a little differently than words. They are said to have weak directionality. The two examples in the picture illustrate this difference.

one two ثلاثة 1234 خمسة AND one two ثلاثة ١٢٣٤ خمسة

Numeric digits run left-to-right, but don't break directional runs. See live demo.

The first example uses European digits, '1234', the second expresses the same number using Arabic-Indic digits, ١٢٣٤. In both cases, the digits in the number are read left-to-right.

Because it is weakly typed, the number is seen as part of the preceding Arabic text, so the two Arabic words that surround the number are treated as part of the same directional run - even though the sequence of digits runs LTR on screen.

Shreeshrii · 2019-02-23T15:05:06Z

Shreeshrii · 2019-02-23T15:06:56Z

The following is the recognition with the finetuned traineddata:

الجفا . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٢٧٨
غرام مُميت . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٣٧٨
الفؤاد الكسير . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٤٧٨
عَقيقَ في عقيق في عقيقي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ٥٧٨
الباب الحادي عشر: مُتفرزقات . . . . . . . . . . . . . . . . . . . . . . . . . ... ... .. ... ٧٧٨
الاي ..................................... ... .. .. ....... ٩٧٨
مَديح الشاي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٩٩٨٨
مشك الشاي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... ٧٩٨٨
ليلة الشاي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٢٩٨٨
رجال السر . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٣٨٨
في فضل الاجتماع . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . ٤٨٨
شحججة . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٩٨٨
للهِ دَؤ بني رَوَاحة ............................................ ٦٨٨
خطة عَبْسِيةَ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٧٨٨
مزايا الزمان . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... ٨٨٨
عَشرَاء . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٩٨٨
قطع علاقة في عتاب ......................................... ٩٩٨
مُعاتبة . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. ... ٢٩٨
السمكة . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٤٩٨
نظرة . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. ٩٩٨
القطار . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٦٩٨
المعالي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .. ٩٨٩٨
المصادر والمراجع . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٠٠٩
الفهرس . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. ٢٠٩
١٩

Shreeshrii · 2019-02-23T15:08:30Z

For reference here is recognition with official traineddata.

tessdata_fast

الباب الحادي عشر : مطرّقات ‏ ...............2.2.2.2.2.2.2.2.2.222.2.2.2.2.2.2.... لالالم
شاي ........ي.ييييييييييييييييييي ...م022 020000262000222 0.... قلام
لح اع ا ع ا يي يي ا ا ع اي ا ا ا ا ا ا ع يا ا ا ا ا ا ا ع ا ل ا ا ا 00
مزايا الزمان ‏ ................ي.....يي..ية.ثثمة نمم نم م.م م.م .... للم
عَشُرَاء تتيييييييييمةمةث ممم ةم ة مم ةم ة ممم م ةلمم ل م ممم ملم م... 44م
قطم علاقة في عتاب ‏ ............2..2.2.22. .606660022622202 00006.... [اؤم
هعائية ......ث..ييييييييممي .مم ممم مم ممم مم ممم مم م م ملم م ممم م.. 17وقم
السمكة ...........ي.ييييييييي.ي.يي.م 60006200 606060666066200 06.... 4454
لظلرة ....ث..ي..ييثيءام ميم ممم ممم مم م 0666066 606660666660600 6... 4460م
المُعالي ............ي.ي.ي.ييييي.مييثثية تمي .ةمث مث م ننم مم ..... 444
المصادر والمراجع ‏ ...............يي.يييييييةيية. تت لل ا م ا ملل فو
القهرض ............ي.ي ...يي ي ةيم مهم مم مم ف ة ةن ة تت ةم ةن م .ل لانو
41

tessdata_best

الجا اال ااا ااا ...مر .ممم ...م.م تتم م.م تم .متي الالامْ
غرام ميت ....... ...م.م ...متم متتتتت ييل الام
الفؤادالكير ...................... ...م ...م.م ...م ...م.م ...تلت غلام
عَقِيقّ في عَقَيق في عقيقٍ ترتمم .م.م ت .مم متت م ممت .ممم ت .رمم يي لام
الباب الحادي عشر: مُطْرّقات .. ...اي الالام,/
الاي .ءءء ءءء لمملا حلام
من اليد ا الا
ملك الشاي ............. ...تت .......................... لهم
لبلةَالشاي ................................................ لظم
رجالالسّر .......... .اتا لات ...مم .ل تكلم
في فضل الاجتماع سه
ا
لل نر بي زَرَاكة ني اهم
خطةعَبسِية .... ...الما ...ا ...تلت ...لل امال ل لالم
سي
عَشْرَاء للا الالالال امال لالم لاما الام اكلم
قطع علاقة في عتاب .. ...ل احم
ا
السلمكة ...ءءء الالالال لال مالالا ااا كم
لطظْرة .ءءء ءءء الالال الل ءءء الالالال اماملا ا فكم
ا اا
الْمعَالي .لل ...ءءء مال ءءء مالالا ااا امكم,
المصادر والمراجع ......................ت...................... 0ف
الفهرس .................ا.ا ...رمم .م.م .مم رترت مم تيمم م.م يليل 970
47

Shreeshrii · 2019-02-23T16:19:19Z

This is not the case with all RTL languages. Hebrew numbers are recognized correctly.

רַאש.ונָה ראשון ‏ אַחַת אֶחָד
שְיֶה || שני | שְתַיִם | שְנם
שלישית שלִישי | שלש | שֶלשָה
רביעית | רביעי | אַרְבַּע | אַרְבְעָה
חמישית חֶמִישי | חֶמש | הַמשָה
ששית | ששי | שש | ששָה
שְבִיעית שְבִיעִי | שְבַע | שְבְעָה
שְמִנִית | שְמִנִי שְמונֶה | שְמוּנֶה
תְשיית תֶשיעִי | תַשַע | הַטְעָה

Shreeshrii · 2019-02-23T16:22:04Z

@jbreiden Can you please check whether Arabic TOC image is recognized correctly at Google? Thanks!

amitdo · 2019-02-23T16:42:55Z

What was the source for finetuning?

Rendered text images via text2image or 'real' images?

amitdo · 2019-02-23T17:04:49Z

Hebrew uses 0123456789.

What you have in the image is words, not numbers:

רַאש.ונָה ראשון ‏ אַחַת אֶחָד

first(female form) first (male) one (f) one (m).

Here is an example in Hebrew:

הוא נולד בשנת 1962 בחיפה

He was born in 1962 in Haifa

Shreeshrii · 2019-02-23T17:22:21Z

What was the source for finetuning?

Rendered text images via text2image or 'real' images?

text2image via tesstrain.sh

Shreeshrii · 2019-02-23T17:23:43Z

Hebrew use 0123456789.

OK. So then this kind of issue will not apply.

EDIT: Here is a test for Hebrew using cropped section from the image for issue #2207 The numbers are being recognized correctly, except for the corner case where line begins with a number (28 is recognized as 8).

יתקיים ביום ראשון יייט במרחשון תשע'יט,
8 באוקטובר 2018 בשעה 15:00 בקמפוס המזמין ברח'
מעגל בית המדרש 7, בעת הכרם, ירושלים. מקום
מפגש:ליד עמדת השומר בכניסה הראשית.
עד ליום שני כ'"ז במרחשוון תשע"ט, 5 בנובמבר 2018 עד
השעה 16:00,
עד ליום שני ייא בכסלו תשע'"ט, 19 בנובמבר 2018 עד
השעה 16:00,

Fahad-Alsaidi · 2019-02-23T18:01:38Z

The following is the recognition with the finetuned traineddata:

الجفا . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٢٧٨
غرام مُميت . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٣٧٨
الفؤاد الكسير . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٤٧٨
عَقيقَ في عقيق في عقيقي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ٥٧٨
الباب الحادي عشر: مُتفرزقات . . . . . . . . . . . . . . . . . . . . . . . . . ... ... .. ... ٧٧٨
الاي ..................................... ... .. .. ....... ٩٧٨
مَديح الشاي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٩٩٨٨
مشك الشاي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... ٧٩٨٨
ليلة الشاي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٢٩٨٨
رجال السر . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٣٨٨
في فضل الاجتماع . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . ٤٨٨
شحججة . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٩٨٨
للهِ دَؤ بني رَوَاحة ............................................ ٦٨٨
خطة عَبْسِيةَ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٧٨٨
مزايا الزمان . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... ٨٨٨
عَشرَاء . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٩٨٨
قطع علاقة في عتاب ......................................... ٩٩٨
مُعاتبة . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. ... ٢٩٨
السمكة . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٤٩٨
نظرة . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. ٩٩٨
القطار . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ٦٩٨
المعالي . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .. ٩٨٩٨
المصادر والمراجع . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ٠٠٩
الفهرس . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. ٢٠٩
١٩

@Shreeshrii
This result is great. Is this finetuned traineddata public ? where can I get it? thanks

Shreeshrii · 2019-02-23T18:24:31Z

I am still experimenting with finetuning. You can get the traineddata files from https://github.com/Shreeshrii/tessdata_arabic

Note: the training_texts have not been updated in the repo yet - I have used numerals in both Arabic and English scripts, added Arabic punctuation and added a few lines in the format of the Table of Contents. Training text is about 5000 lines, Eval text is approx. 500 lines and I am doing plus-minus training using script/Arabic.traineddata as the starting point.

I finetuned with only one font at a time- so latest files are

ara-Amiri
ara-Scheherazade
ara-Scheherazade-int (Integer model of above, much smaller file)

On my random eval set the error rate is 3-4%. However, as noted in this issue, the numerals are in reverse order.

ara-Amiri-invert (included sample with white text on black background)
ara-Amiri-invert-int (Integer model of above, much smaller file)
These have a much higher error rate.

I am now testing finetuning with multiple fonts.

Shreeshrii · 2019-02-24T05:21:49Z

As a test, here is another TOC in Arabic document with numbers in Latin script, image taken from https://tex.stackexchange.com/questions/213222/chapter-numbering-in-table-of-contents

These seem to be recognized correctly in the finetuned traineddata.

كلمة المتزجم
تقديم الكتاب
الباب الأول. التحليل التوافقي 1

مقدمة . . . . . .. ..... ..... . 1
للدالأسسيلعد . .. ..... ............: . 2
1.3 التباديل ......... 4
1.4 التوافيق .. ...... ..... 6
1.5 معاملات كثيرات الحدود ......... 11
"1.6 عدد الحلول الصحيحة للمعادلات 0 } } : . : : . : 0 : . : . . : . 15
ملخص الفصل ....... 19
مسائل . . . . . . . . .. . .. ... ..... ..... 19
تمارين نظرية .......... 23
اختبارات ذاتية في المسائل والتمارين . . . . . . . . . . . . . . . . 26
الباب الثاني. مسلمات الاحتمالات 29
2.1 مقئمة. . . . . . .. .. ..... ..... 29
2.2 فراغ العينة و الحوادث ..... ...... ...... 29
2.3 مسلمات الاحتملات . . . . . . . . . . .. ... ...... 35
2.4 بعض المبرهنات البسيطة ..... ...... . . 38
فراغات العينة بتائح متكافة الفرص . . . . . . . . . . .. . . . . 44

tessdata_best

كلمة المترجم
تقديم الكتاب
الباب الأول. التحليل التوافقي 1
1 مقلمة..... مام .امم مم .امي .1
2 يدا الأماني للعد تيتا متيل 20
3 الباديل يي 4
4 التوافيق لام م .م م يي 6
5 معاملات كثيرات الحدود 11
“1.6 عد الحلول الصحيحة للمعادلات .ايا ا ل 15
ملخص الفصل ااا 19
مسائل ...تت ...ل تت م يل 19
تمارين نظرية يني ليلل 23
اختبارات ذاتية في المسائل والتمارين . .اا ل 00 0.2000 26
الباب الثاني. مسلّمات الاحتمالات 29
1 مقلمة ا تلت تت تت لت تت .تت يي 29
2 فاغ العينة و الحوادث بي تت .م تت م ل .يل 29
3 مسلمات الاحصالات تت تت م م م لما 35
4 بض المبرهنات البسيطة مت . .م م م يي 38
5 فاغات العينة بتائج متكاففة القرضص ااا ااا 44

Shreeshrii · 2019-02-25T12:10:57Z

Fixed via #2270

Here is the display of OCRed output in notepad++ in RTL view.

Original image is linked at #2263 (comment)

Shreeshrii · 2019-02-28T09:01:38Z

See #2270 (comment) for links to test image with numbers at beginning, middle and end of line and OCR results.

Thanks @amitdo for reviewing.

mohdbm · 2020-11-12T13:19:58Z

@Shreeshrii
I would like to participate in finetuning Arabic recognition in tesseract5, how can I contact you, since I need your help in helping me understanding how I can finetune and train tesseract, how can I generate a training set, and many other questions... Thank you

mohdbm · 2020-11-12T13:23:51Z

since when I tried to recognize the above image using standard tesseract v5.0.0.20190623, using Arabic language only, I achieved the below result:

mohdbm · 2020-11-12T13:27:22Z

so I would like to get from you the knowledge on how to use the finetuned data to:

recognize Arabic punctuation
recognize Arabic-Indic Digits
working to include more fonts

Shreeshrii · 2020-11-12T15:07:23Z

Please see https://github.com/tesseract-ocr/tesstrain/wiki/Arabic-Handwriting tesseract-ocr/tesstrain#176 and tesseract-ocr/tesstrain#128 https://github.com/Shreeshrii/tesstrain-arabic-GS It's been almost a year since I did that training, so I suggest that you try with a small training set to resolve the issues with punctuation and Arabic-Indic digits. <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

…

On Thu, Nov 12, 2020 at 6:57 PM mohdbm ***@***.***> wrote: so I would like to get from you the knowledge on how to use the finetuned data to: 1. recognize Arabic punctuation 2. recognize Arabic-Indic Digits 3. working to include more fonts — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2263 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37IZALSHDLBVJR2F3BTTSPPPETANCNFSM4GZX6H4Q> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

MostafaAbdElRasoul · 2023-11-01T14:16:47Z

Fixed via #2270

Here is the display of OCRed output in notepad++ in RTL view.

Original image is linked at #2263 (comment)

I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers

This was referenced Feb 23, 2019

Add Indic numerals and missing punctuation to Arabic tesseract-ocr/langdata#131

Open

Arabic training data has room for improvement #2047

Open

Shreeshrii mentioned this issue Feb 25, 2019

DO NOT MERGE: Fix reversal of numerals in Arabic script #2266

Closed

Shreeshrii mentioned this issue Feb 28, 2019

Treat U_ARABIC_NUMBER as LTR #2270

Merged

Shreeshrii closed this as completed Feb 28, 2019

amitdo added the RTL label Mar 18, 2021

amitdo added the TOC label Feb 8, 2022

amitdo mentioned this issue Feb 8, 2022

Multiple punctuation issue #3748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbers in Arabic script are getting reversed #2263

Numbers in Arabic script are getting reversed #2263

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

amitdo commented Feb 23, 2019

amitdo commented Feb 23, 2019 •

edited

Loading

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019 •

edited

Loading

Fahad-Alsaidi commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019 •

edited

Loading

Shreeshrii commented Feb 24, 2019

Shreeshrii commented Feb 25, 2019 •

edited

Loading

Shreeshrii commented Feb 28, 2019

mohdbm commented Nov 12, 2020

mohdbm commented Nov 12, 2020

mohdbm commented Nov 12, 2020

Shreeshrii commented Nov 12, 2020 via email

MostafaAbdElRasoul commented Nov 1, 2023

Numbers in Arabic script are getting reversed #2263

Numbers in Arabic script are getting reversed #2263

Comments

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019

amitdo commented Feb 23, 2019

amitdo commented Feb 23, 2019 • edited Loading

Shreeshrii commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019 • edited Loading

Fahad-Alsaidi commented Feb 23, 2019

Shreeshrii commented Feb 23, 2019 • edited Loading

Shreeshrii commented Feb 24, 2019

Shreeshrii commented Feb 25, 2019 • edited Loading

Shreeshrii commented Feb 28, 2019

mohdbm commented Nov 12, 2020

mohdbm commented Nov 12, 2020

mohdbm commented Nov 12, 2020

Shreeshrii commented Nov 12, 2020 via email

MostafaAbdElRasoul commented Nov 1, 2023

amitdo commented Feb 23, 2019 •

edited

Loading

Shreeshrii commented Feb 23, 2019 •

edited

Loading

Shreeshrii commented Feb 23, 2019 •

edited

Loading

Shreeshrii commented Feb 25, 2019 •

edited

Loading