-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic-Indic numerals #858
Comments
Ref:
So, basically, there are three unicode ranges with numerals used in Arabic, Persian etc.
If the fonts are putting https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.training_text has 'Arabic numerals' range of 0x30 to 0x39. You can check whether it as ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) and add it, if you want to include it for training. |
If that numerals are indeed missing from the official traineddata, I suggest to open a new issue in the langdata repo. |
Did Anyone fix this problem? I am not using Unix in order to be able to train tesseract on new data, but I need to use the Eastern arabic numerals. if someone fixed it and has the traineddata file, please share it with us Thanks |
Persian's number's shape mostly the same as Arabic's but their Unicode is different! |
Are these getting recognized in the best traineddata? |
Yes, it mixed Persian with Arabic numbers (unicode) for example the image had these numbers |
for more information see Unicode Number, Decimal Digit' Category |
@theraysmith |
usually, people use the un-standard keyboard (Arabic keyboard for typing Persian text) so there are many scan images of Persian's text which have Arabic numbers like ٣ ٤ ٥ ٦ but the OCR should convert them to Persian Unicode |
Question from Ray in tesseract-ocr/langdata#72
|
@zdenop, please close this issue. The issue is related to the trained data. not code. As said, the right place for this issue is the langdata repo. |
Hi,
I'm using tesseract 4.00alpha with liptonica 1.74.1 on Ubuntu 14 to create LSTM files for multiple Arabic fonts, which some of them have the common numerical system, (1 2 3 4 ...) but some of these font contains the a different numerical system, which usually more common in the Arabic scripts,
which are ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩)
yet the last set of numbers were not recognize but as symbols such as ! instead of ١ ,are these numbers are not integrated in the tesseract?
Thanks
The text was updated successfully, but these errors were encountered: