Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic-Indic numerals #858

Closed
ibr123 opened this issue May 1, 2017 · 11 comments
Closed

Arabic-Indic numerals #858

ibr123 opened this issue May 1, 2017 · 11 comments

Comments

@ibr123
Copy link

ibr123 commented May 1, 2017

Hi,

I'm using tesseract 4.00alpha with liptonica 1.74.1 on Ubuntu 14 to create LSTM files for multiple Arabic fonts, which some of them have the common numerical system, (1 2 3 4 ...) but some of these font contains the a different numerical system, which usually more common in the Arabic scripts,
which are ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩)
yet the last set of numbers were not recognize but as symbols such as ! instead of ١ ,are these numbers are not integrated in the tesseract?
Thanks

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented May 2, 2017

Ref:

The Arabic numeral glyphs 0–9 are encoded in ASCII and Unicode at positions 0x30 to 0x39, matching up with the second hexadecimal digit for convenience:

The Eastern Arabic numerals (also called Arabic–Indic numerals and Arabic Eastern numerals) are the symbols used to represent the Hindu–Arabic numeral system, in conjunction with the Arabic alphabet.

Each numeral in the Persian variant has a different Unicode point even if it looks identical to the Eastern Arabic numeral counterpart. However the variants used with Urdu, Sindhi, and other South Asian languages are not encoded separately from the Persian variants.

See U+0660 through U+0669 and U+06F0 through U+06F9.

So, basically, there are three unicode ranges with numerals used in Arabic, Persian etc.

  • 0x30 to 0x39
  • U+0660 through U+0669
  • U+06F0 through U+06F9

If the fonts are putting Eastern Arabic numerals U+0660 through U+0669 in the Arabic numerals range of 0x30 to 0x39, that would cause confusion during training.

https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.training_text has 'Arabic numerals' range of 0x30 to 0x39. You can check whether it as ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) and add it, if you want to include it for training.

@amitdo
Copy link
Collaborator

amitdo commented May 2, 2017

If that numerals are indeed missing from the official traineddata, I suggest to open a new issue in the langdata repo.

@aboelmor
Copy link

Did Anyone fix this problem? I am not using Unix in order to be able to train tesseract on new data, but I need to use the Eastern arabic numerals. if someone fixed it and has the traineddata file, please share it with us

Thanks

@reza1615
Copy link

reza1615 commented Aug 5, 2017

Persian's number's shape mostly the same as Arabic's but their Unicode is different!
Persian numbers= ۹ ۸ ۷ ۶ ۵ ۴ ۳ ۲ ۱ ۰
Arabic numbers = ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
Persian numbers' Unicode= \u06F9 \u06F8 \u06F7 \u06F6 \u06F5 \u06F4 \u06F3 \u06F2 \u06F1 \u06F0
Arabic numbers' Unicode =\u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669
you can check them here

@Shreeshrii
Copy link
Collaborator

@reza1615

Are these getting recognized in the best traineddata?
Are they being recognized as Arabic unicode numbers?

@reza1615
Copy link

reza1615 commented Aug 5, 2017

Yes, it mixed Persian with Arabic numbers (unicode) for example the image had these numbers
۱-۲ and it recognize ۱ as Persian number and ۲ as Arabic number their shape is the same but for searching and Unicode, they are different.
in another hand 3 and 4 and 5 and 6's shape are not the same see below
6 5 4 3
۶ ۵ ۴ ۳ >Persian
٣ ٤ ٥ ٦ > Arabic
you can check it at here with the output txt file

@reza1615
Copy link

reza1615 commented Aug 5, 2017

for more information see Unicode Number, Decimal Digit' Category

@Shreeshrii
Copy link
Collaborator

@theraysmith
Please update the desired characters for persian for the persian unicode range of numbers and ignore the unicode arabic number range for fas (persian), as mentioned above. Thanks!

@reza1615
Copy link

reza1615 commented Aug 5, 2017

usually, people use the un-standard keyboard (Arabic keyboard for typing Persian text) so there are many scan images of Persian's text which have Arabic numbers like ٣ ٤ ٥ ٦ but the OCR should convert them to Persian Unicode

@Shreeshrii
Copy link
Collaborator

Question from Ray in tesseract-ocr/langdata#72

Anyone know which digits are needed for the other Arabic languages?
kur_ara, pus, uig

@amitdo
Copy link
Collaborator

amitdo commented Sep 12, 2017

@zdenop, please close this issue.

The issue is related to the trained data. not code.

As said, the right place for this issue is the langdata repo.
See tesseract-ocr/langdata#71, tesseract-ocr/langdata#72

@zdenop zdenop closed this as completed Sep 12, 2017
@tesseract-ocr tesseract-ocr deleted a comment from rockerbaba Oct 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants