Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Shreeshrii · 2017-05-02T08:01:53Z

Add 0-9 and

Perso-Arabic variant ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹

for Persian, Urdu and Sindhi

Please see tesseract-ocr/tesseract#858

Shreeshrii · 2017-05-11T09:08:33Z

tesseract-ocr/tesseract#894

The rightmost column in image has 2 digit numbers, but most of the time only one digit seems to be recognized.

theraysmith · 2017-08-08T00:39:36Z

I've added them to my copy of desired_characters. I'll push them to github after testing.
Anyone know which digits are needed for the other Arabic languages?
kur_ara, pus, uig

reza1615 · 2017-08-08T07:20:41Z

@theraysmith https://en.wikipedia.org/wiki/Modern_Arabic_mathematical_notation#Variations
and
https://en.wikipedia.org/wiki/Eastern_Arabic_numerals#Numerals

ebraminio · 2017-08-08T09:27:39Z

Kurdish with Arabic script (kur) uses Arabic-Indic (١٢٣٤٥٦٧٨٩), Pashto (pus) uses either same with Persian (۱۲۳۴۵۶۷۸۹) or West Arabic (a.k.a European, 123456789), Uighur (uig) uses European.

There is a solution that you check by your own which language uses what digits, open your browser console and enter these, each line separately (needs two letters code, not three letters which tesseract uses):

(123456.789).toLocaleString('ckb') // ١٢٣٬٤٥٦٫٧٨٩ (Arabic-Indic)
(123456.789).toLocaleString('ug') // 123,456.789
(123456.789).toLocaleString('ps') // Interesting that Safari gives "۱۲۳٬۴۵۶٫۷۸۹" (Extended Arabic-Indic similar to Persian) but Chrome "123,456.789"

Please note that Urdu text may use digits with same unicode with Persian but with different appearance (but European style digits seems nowadays are used more often with Urdu), open this on your browser (Urdu appearance of Arabic-Indic extended digits):

data:text/html;charset=utf8,<div lang="ur" style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹

and compare it with (default, and Persian appearance of Arabic-Indic extended digits):

data:text/html;charset=utf8,<div style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹

Same Unicode but different appearance. Opentype, more accurately, a font able to handle opentype language tag feature, handles this magic and Pango, which you use for creation of training dataset for tesseract, is able to handle this for you if language code is passed correctly.

roozgar · 2017-08-08T09:49:31Z

in persian ziro to nine is listed correctly
also "," is used for digit separation...

Shreeshrii · 2017-08-09T03:27:20Z

Thank you all for your helpful input.

theraysmith · 2017-08-10T18:37:10Z

+1 I've updated the desired_characters and the next training will use the correct digits. I'm implementing the same solution for vowels/points as Hebrew, so it should improve recognition of words with them. The difficulty is that Arabic seems a lot more complex than Hebrew because there are many languages that use different variants of the script with different characters, as well as the different display styles. I'm not sure about how that affects the use of point/vowels, or whether there are vowels that are unique to the different languages.

…

On Tue, Aug 8, 2017 at 8:27 PM, Shreeshrii ***@***.***> wrote: Thank you all for your helpful input. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#72 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056URNReHYbDIGmtnZ3SMZsNhcb3uMks5sWScagaJpZM4NN05u> .

-- Ray.

reza1615 · 2017-08-10T20:58:24Z

@theraysmith
1- here is listed all arabic family characters.
I check the table plus numbers there are some other similar characters which have different Unicode:

ۀ = \u06C0
ۂ =\u06C2
هٔ = \u0647 + \u0654

إ =\u0625
ٳ =\u0673

ٲ =\u0672
أ =\u0623
ٵ =\u0675

، =\u060C
٬ =\u066C
٫ =\u066B

064E
0659

ڼ =\u06BC
ڹ=\u06B9

06EC
06E0
06F0
0660
06DF
06EB
06EA
. = (dot)

0674
0655
0654
065F
0621

٭ =\u066D

= *

you can check their Unicode at here
2-at http://collation-charts.org/icu442/ there is list of many languages and their official characters (you can find Persian, Pashto, Arabic, ...) separately like
3- vowels (main vowels Unicode = [\u064B-\u0650\u0652\u0670] ) have unique Unicode for all member of the Arabic family.

gheyret · 2017-08-22T06:19:29Z

Uyghur(Uighur) language uses 0123456789 digits.

amitdo · 2021-02-26T00:41:53Z

This issue should be re-opened.

Shreeshrii mentioned this issue May 8, 2017

Error sequence in persian language tesseract-ocr/tesseract#894

Closed

theraysmith closed this as completed Aug 8, 2017

This was referenced Aug 8, 2017

Best Traineddata Feedback - Persian tesseract-ocr/tessdata#70

Open

Arabic-Indic numerals tesseract-ocr/tesseract#858

Closed

About Uyghur(Uighur) langdata #68

Open

ebraminio mentioned this issue Aug 10, 2017

Which set of digits should be used for Pashto? w3c/alreq#136

Closed

amitdo mentioned this issue Apr 24, 2020

The 'Tesseract' able to recognize 'Arabic' words but not 'Arabic' numerals from scanned Image using Python tesseract-ocr/tesseract#2955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Shreeshrii commented May 2, 2017

Shreeshrii commented May 11, 2017

theraysmith commented Aug 8, 2017

reza1615 commented Aug 8, 2017 •

edited

Loading

ebraminio commented Aug 8, 2017 •

edited

Loading

roozgar commented Aug 8, 2017

Shreeshrii commented Aug 9, 2017

theraysmith commented Aug 10, 2017 via email

reza1615 commented Aug 10, 2017

gheyret commented Aug 22, 2017

amitdo commented Feb 26, 2021

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Comments

Shreeshrii commented May 2, 2017

Shreeshrii commented May 11, 2017

theraysmith commented Aug 8, 2017

reza1615 commented Aug 8, 2017 • edited Loading

ebraminio commented Aug 8, 2017 • edited Loading

roozgar commented Aug 8, 2017

Shreeshrii commented Aug 9, 2017

theraysmith commented Aug 10, 2017 via email

reza1615 commented Aug 10, 2017

gheyret commented Aug 22, 2017

amitdo commented Feb 26, 2021

reza1615 commented Aug 8, 2017 •

edited

Loading

ebraminio commented Aug 8, 2017 •

edited

Loading