Arabic-Indic numerals #858

ibr123 · 2017-05-01T09:49:10Z

Hi,

I'm using tesseract 4.00alpha with liptonica 1.74.1 on Ubuntu 14 to create LSTM files for multiple Arabic fonts, which some of them have the common numerical system, (1 2 3 4 ...) but some of these font contains the a different numerical system, which usually more common in the Arabic scripts,
which are ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩)
yet the last set of numbers were not recognize but as symbols such as ! instead of ١ ,are these numbers are not integrated in the tesseract?
Thanks

Shreeshrii · 2017-05-02T04:12:01Z

Ref:

The Arabic numeral glyphs 0–9 are encoded in ASCII and Unicode at positions 0x30 to 0x39, matching up with the second hexadecimal digit for convenience:

The Eastern Arabic numerals (also called Arabic–Indic numerals and Arabic Eastern numerals) are the symbols used to represent the Hindu–Arabic numeral system, in conjunction with the Arabic alphabet.

Each numeral in the Persian variant has a different Unicode point even if it looks identical to the Eastern Arabic numeral counterpart. However the variants used with Urdu, Sindhi, and other South Asian languages are not encoded separately from the Persian variants.

See U+0660 through U+0669 and U+06F0 through U+06F9.

So, basically, there are three unicode ranges with numerals used in Arabic, Persian etc.

0x30 to 0x39
U+0660 through U+0669
U+06F0 through U+06F9

If the fonts are putting Eastern Arabic numerals U+0660 through U+0669 in the Arabic numerals range of 0x30 to 0x39, that would cause confusion during training.

https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.training_text has 'Arabic numerals' range of 0x30 to 0x39. You can check whether it as ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) and add it, if you want to include it for training.

amitdo · 2017-05-02T06:37:49Z

If that numerals are indeed missing from the official traineddata, I suggest to open a new issue in the langdata repo.

aboelmor · 2017-07-29T04:41:52Z

Did Anyone fix this problem? I am not using Unix in order to be able to train tesseract on new data, but I need to use the Eastern arabic numerals. if someone fixed it and has the traineddata file, please share it with us

Thanks

reza1615 · 2017-08-05T14:37:41Z

Persian's number's shape mostly the same as Arabic's but their Unicode is different!
Persian numbers= ۹ ۸ ۷ ۶ ۵ ۴ ۳ ۲ ۱ ۰
Arabic numbers = ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
Persian numbers' Unicode= \u06F9 \u06F8 \u06F7 \u06F6 \u06F5 \u06F4 \u06F3 \u06F2 \u06F1 \u06F0
Arabic numbers' Unicode =\u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669
you can check them here

Shreeshrii · 2017-08-05T14:47:05Z

@reza1615

Are these getting recognized in the best traineddata?
Are they being recognized as Arabic unicode numbers?

reza1615 · 2017-08-05T14:53:44Z

Yes, it mixed Persian with Arabic numbers (unicode) for example the image had these numbers
۱-۲ and it recognize ۱ as Persian number and ۲ as Arabic number their shape is the same but for searching and Unicode, they are different.
in another hand 3 and 4 and 5 and 6's shape are not the same see below
6 5 4 3
۶ ۵ ۴ ۳ >Persian
٣ ٤ ٥ ٦ > Arabic
you can check it at here with the output txt file

reza1615 · 2017-08-05T15:05:08Z

for more information see Unicode Number, Decimal Digit' Category

Shreeshrii · 2017-08-05T15:08:45Z

@theraysmith
Please update the desired characters for persian for the persian unicode range of numbers and ignore the unicode arabic number range for fas (persian), as mentioned above. Thanks!

reza1615 · 2017-08-05T15:41:23Z

usually, people use the un-standard keyboard (Arabic keyboard for typing Persian text) so there are many scan images of Persian's text which have Arabic numbers like ٣ ٤ ٥ ٦ but the OCR should convert them to Persian Unicode

Shreeshrii · 2017-08-08T01:50:07Z

Question from Ray in tesseract-ocr/langdata#72

Anyone know which digits are needed for the other Arabic languages?
kur_ara, pus, uig

amitdo · 2017-09-12T11:20:34Z

@zdenop, please close this issue.

The issue is related to the trained data. not code.

As said, the right place for this issue is the langdata repo.
See tesseract-ocr/langdata#71, tesseract-ocr/langdata#72

This was referenced May 2, 2017

Add Arabic-Indic numerals to Arabic tesseract-ocr/langdata#71

Closed

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi tesseract-ocr/langdata#72

Closed

zdenop closed this as completed Sep 12, 2017

tesseract-ocr deleted a comment from rockerbaba Oct 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic-Indic numerals #858

Arabic-Indic numerals #858

ibr123 commented May 1, 2017

Shreeshrii commented May 2, 2017 •

edited

Loading

amitdo commented May 2, 2017

aboelmor commented Jul 29, 2017

reza1615 commented Aug 5, 2017

Shreeshrii commented Aug 5, 2017

reza1615 commented Aug 5, 2017 •

edited

Loading

reza1615 commented Aug 5, 2017

Shreeshrii commented Aug 5, 2017

reza1615 commented Aug 5, 2017

Shreeshrii commented Aug 8, 2017

amitdo commented Sep 12, 2017

Arabic-Indic numerals #858

Arabic-Indic numerals #858

Comments

ibr123 commented May 1, 2017

Shreeshrii commented May 2, 2017 • edited Loading

amitdo commented May 2, 2017

aboelmor commented Jul 29, 2017

reza1615 commented Aug 5, 2017

Shreeshrii commented Aug 5, 2017

reza1615 commented Aug 5, 2017 • edited Loading

reza1615 commented Aug 5, 2017

Shreeshrii commented Aug 5, 2017

reza1615 commented Aug 5, 2017

Shreeshrii commented Aug 8, 2017

amitdo commented Sep 12, 2017

Shreeshrii commented May 2, 2017 •

edited

Loading

reza1615 commented Aug 5, 2017 •

edited

Loading