-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642
Comments
|
|
This seems to be happening when an --eval_listfile is given. Seems to work if that is not given. See below:
without --eval_listfile process continues
|
@Shreeshrii I have noticed that the Arabic text in your log is reversed, A representation of this mistake, example: The Arabic language read/write from right to left ( RTL ) |
Thanks for pointing it out.
I neither know Arabic nor am familiar with bidi.
Is it just one line that is reversed or all?
I am using the training text from langdata, prefixed with sample with
diacritics provided by @bmwmy along with few words copied from wikipedia.
I had copied the error msg from the console. I could try to save the log in
a file to see if that is correct, since it is possible that my locale under
bash on Windows 10 does not support Arabic.
- excuse the brevity, sent from mobile
…On 10-Jan-2017 1:16 AM, "christophered" ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii> I have noticed that the
Arabic text in your log is reversed,
Your log shows: مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
It should be: بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
A representation of this mistake, example:
Correct: Peace Be Upon You
Wrong: uoY nopU eB ecaeP
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#642 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o9eQiTRRJyspo6OSoBaTRMgYZRsHks5rQo6WgaJpZM4LdVVV>
.
|
@Shreeshrii could you post some generated image files (tif) to look if Arabic text is rendered correctly! |
Please see attached, the zip file has the training text, box tiff pair and
unicharset.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Jan 10, 2017 at 2:55 PM, bmwmy ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii> could you post some generated
image files (tif) to look if Arabic text is rendered correctly!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#642 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o389hnCFPZQnP8q0ueqrdLdfTZB9ks5rQ05-gaJpZM4LdVVV>
.
|
|
I had attached file via email. Maybe github does not allow that. Will
upload on forum.
- excuse the brevity, sent from mobile
…On 10-Jan-2017 5:30 PM, "christophered" ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii>
-
All the Arabic language lines are reversed.
-
I am have checked the samples from #552
<#552>
The "Original_Text.txt" was encoded in (UTF-8-BOM) and everything
seems okay.
-
So attach the tif/box that you are using
I am not seeing any zip files here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#642 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o3pcEjeaz_dh6hSyK7S7E5g3vly2ks5rQ3LNgaJpZM4LdVVV>
.
|
Uploaded zip file with training data for a group of fonts which have coverage for Arabic on Windows. It is possible that the tesstrain.sh process is dropping diacritics as noise. I am trying to change config variables to see if I can get some improvement. |
Attached is a log file which shows verbose output for every iteration of training - from middle of current training session. |
@Shreeshrii Initial Observation:
When i used them in my training process, i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated ) Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ ) |
@theraysmith @amitdo @Shreeshrii
Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know. |
(Controlled Parnell/clock language and region/ region/ administrative/ change system locale/ Arabic "Saudi Arabia") Also, when using txt, the words are not in their correct order. at google chrome the words are correct, but once copying them and pasting them in a text file, the order is change, what a weird issue. |
@theraysmith @amitdo @Shreeshrii
|
It is possible that I copied some text from wikipedia which is incorrect. Please look at the training_text file and let me know which lines should be deleted.
Please share your training text and I can give it a try. |
Original problem, core dumped - Arabic related issues: Closing this issue. |
The langdata text files for all languages are saved using UTF-8 encoding. |
i am trying to train or finetune tesseract for my own dataset on farsi language . can anyone please help me through this ? |
While Add Top layer LSTM training worked for Latin unicharset based languages (eng, nor), It is failing for Arabic.
I am copying below the log for creating lstmf files and then for the training.
The text was updated successfully, but these errors were encountered: