-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72
Comments
The rightmost column in image has 2 digit numbers, but most of the time only one digit seems to be recognized. |
I've added them to my copy of desired_characters. I'll push them to github after testing. |
Kurdish with Arabic script (kur) uses Arabic-Indic (١٢٣٤٥٦٧٨٩), Pashto (pus) uses either same with Persian (۱۲۳۴۵۶۷۸۹) or West Arabic (a.k.a European, 123456789), Uighur (uig) uses European. There is a solution that you check by your own which language uses what digits, open your browser console and enter these, each line separately (needs two letters code, not three letters which tesseract uses): (123456.789).toLocaleString('ckb') // ١٢٣٬٤٥٦٫٧٨٩ (Arabic-Indic)
(123456.789).toLocaleString('ug') // 123,456.789
(123456.789).toLocaleString('ps') // Interesting that Safari gives "۱۲۳٬۴۵۶٫۷۸۹" (Extended Arabic-Indic similar to Persian) but Chrome "123,456.789" Please note that Urdu text may use digits with same unicode with Persian but with different appearance (but European style digits seems nowadays are used more often with Urdu), open this on your browser (Urdu appearance of Arabic-Indic extended digits):
and compare it with (default, and Persian appearance of Arabic-Indic extended digits):
Same Unicode but different appearance. Opentype, more accurately, a font able to handle opentype language tag feature, handles this magic and Pango, which you use for creation of training dataset for tesseract, is able to handle this for you if language code is passed correctly. |
in persian ziro to nine is listed correctly |
Thank you all for your helpful input. |
+1 I've updated the desired_characters and the next training will use the
correct digits.
I'm implementing the same solution for vowels/points as Hebrew, so it
should improve recognition of words with them.
The difficulty is that Arabic seems a lot more complex than Hebrew because
there are many languages that use different variants of the script with
different characters, as well as the different display styles.
I'm not sure about how that affects the use of point/vowels, or whether
there are vowels that are unique to the different languages.
…On Tue, Aug 8, 2017 at 8:27 PM, Shreeshrii ***@***.***> wrote:
Thank you all for your helpful input.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#72 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056URNReHYbDIGmtnZ3SMZsNhcb3uMks5sWScagaJpZM4NN05u>
.
--
Ray.
|
@theraysmith ۀ = \u06C0 إ =\u0625 ٲ =\u0672 ، =\u060C 064E ڼ =\u06BC 06EC 0674 ٭ =\u066D
you can check their Unicode at here |
Uyghur(Uighur) language uses 0123456789 digits. |
This issue should be re-opened. |
Add 0-9 and
Perso-Arabic variant ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
for Persian, Urdu and Sindhi
Please see tesseract-ocr/tesseract#858
The text was updated successfully, but these errors were encountered: