LSTM: Training - missing file /langdata/radical-stroke.txt #542

Shreeshrii · 2016-12-07T06:06:51Z

Other case É of é is not in unicharset
Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Failed to load radical-stroke info from: ../langdata/radical-stroke.txt
Warning: given outputs 105 not equal to unicharset of 113.

@theraysmith I am trying to run the commands given in training tutorial.

the above messages are from basetrain.log.
Does the langdata repo need to be updated for 4.0 alpha?

.

The text was updated successfully, but these errors were encountered:

theraysmith · 2016-12-07T16:59:49Z

Fixed in tesseract-ocr/langdata@3299c60.
I'm retesting now. It seems the tutorial works without it, so I imagine the accuracy numbers in the tutorial will come out different.

Shreeshrii · 2016-12-07T17:16:13Z

Thanks, Ray.

…

On 07-Dec-2016 10:29 PM, "theraysmith" ***@***.***> wrote: Fixed in ***@***.*** <tesseract-ocr/langdata@3299c60> . I'm retesting now. It seems the tutorial works without it, so I imagine the accuracy numbers in the tutorial will come out different. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#542 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o1Gwtcn4WKaDYuXl83XfkwNT8-mUks5rFuYMgaJpZM4LGOuy> .

theraysmith · 2016-12-07T20:52:24Z

I've just updated the numbers in the training tutorial.
It seems to work apart from one thing that needs looking at - it doesn't run the eval from the trainer.
It doesn't harm the tutorial, but will be required before people start serious training.

Shreeshrii · 2016-12-08T03:54:14Z

Ray,

Please add what are the minimum requirements for doing LSTM training in terms of hardware, software, etc.

I realized after running the process that I needed to build Scrollview.jar. I am not sure whether it is REQUIRED or only optional for those who would like to see visual debugging output. It is not built as part of the regular make install of tesseract and training tools.

It will stop at 5000 iterations, (in about half an hour)

I think that is probably dependent on the hardware used. I did not get any progress for more than one and a half hour - not sure whether it was because I did not have scrollview.jar at that point. I ran it later with 500 iterations.

I think it maybe helpful to have just a single iteration as the first step in tutorial to make sure that the process is working.

Also, the case that I think most people would like to use for LSTM training would be to use Finetuning to add a font to the existing trainingdata. It would be helpful to have a separate page on wiki for it.

It would also be great to know how to add training data based on scanned images for typefaces that are not available as fonts.

I will try to test 'finetuning' the Hindi traineddata for Sanskrit and post here.

soumen100 · 2022-11-22T18:31:32Z

p;lease help me out :

Config file is optional, continuing...
Failed to read data from: data/langdata/Apex/Apex.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

soumen100 · 2022-11-22T18:31:55Z

tesseract "data/Apex-ground-truth/eng_64.tif" data/Apex-ground-truth/eng_64 --psm 13 lstm.train

tesseract data/Apex-ground-truth/eng_64.tif data/Apex-ground-truth/eng_64 --psm 13 lstm.train
set -x;
tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train
tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
python3 shuffle.py 0 "data/Apex/all-lstmf"
head -n 90 data/Apex/all-lstmf
tail -n 10 data/Apex/all-lstmf
combine_lang_model
--input_unicharset data/Apex/unicharset
--script_dir data/langdata
--numbers data/Apex/Apex.numbers
--puncs data/Apex/Apex.punc
--words data/Apex/Apex.wordlist
--output_dir data

--lang Apex
Failed to read data from: data/Apex/Apex.wordlist
Failed to read data from: data/Apex/Apex.punc
Failed to read data from: data/Apex/Apex.numbers
Loaded unicharset of size 113 from file data/Apex/unicharset
Setting unichar properties
Other case É of é is not in unicharset
Other case FI of fi is not in unicharset
Setting script properties
Failed to load script unicharset from:data/langdata/Latin.unicharset
Warning: properties incomplete for index 3 = C
Warning: properties incomplete for index 4 = H
Warning: properties incomplete for index 5 = E
Warning: properties incomplete for index 6 = S
Warning: properties incomplete for index 7 = -
Warning: properties incomplete for index 8 = R
Warning: properties incomplete for index 9 = I
Warning: properties incomplete for index 10 = K

Ham714 · 2023-01-10T13:21:20Z

tesseract "data/Apex-ground-truth/eng_64.tif" data/Apex-ground-truth/eng_64 --psm 13 lstm.train

tesseract data/Apex-ground-truth/eng_64.tif data/Apex-ground-truth/eng_64 --psm 13 lstm.train
set -x;
tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train

tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
python3 shuffle.py 0 "data/Apex/all-lstmf"

head -n 90 data/Apex/all-lstmf

tail -n 10 data/Apex/all-lstmf
combine_lang_model
--input_unicharset data/Apex/unicharset
--script_dir data/langdata
--numbers data/Apex/Apex.numbers
--puncs data/Apex/Apex.punc
--words data/Apex/Apex.wordlist
--output_dir data

--lang Apex
Failed to read data from: data/Apex/Apex.wordlist
Failed to read data from: data/Apex/Apex.punc
Failed to read data from: data/Apex/Apex.numbers
Loaded unicharset of size 113 from file data/Apex/unicharset
Setting unichar properties
Other case É of é is not in unicharset
Other case FI of fi is not in unicharset
Setting script properties
Failed to load script unicharset from:data/langdata/Latin.unicharset
Warning: properties incomplete for index 3 = C
Warning: properties incomplete for index 4 = H
Warning: properties incomplete for index 5 = E
Warning: properties incomplete for index 6 = S
Warning: properties incomplete for index 7 = -
Warning: properties incomplete for index 8 = R
Warning: properties incomplete for index 9 = I
Warning: properties incomplete for index 10 = K

did you find the solution

nkrot · 2023-11-27T18:08:01Z

p;lease help me out :

Config file is optional, continuing... Failed to read data from: data/langdata/Apex/Apex.config Failed to read data from: data/langdata/radical-stroke.txt Error reading radical code table data/langdata/radical-stroke.txt make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

Before running make training, please run make tesseract-langdata. It will download the file radical-stroke.txt as well as many SCRIPT.unicharset files into your data/langdata directory.

manuthvann216 · 2024-01-22T05:21:41Z

@nkrot

hello dear , thx for ur suggestion , I did follow it and currently I do have langdata downloaded into tesstrain but once I run this cmd :
make training MODEL_NAME=Apex TESSDATA=tessdata_best MAX_ITERATIONS=10

I got this error

I'd greatly appreciate ur help alot

nkrot · 2024-03-01T10:17:24Z

@manuthvann216

I cannot give you an easy answer to your question. I am still learning how to train tesseract and I feel like Hercules in Augean stables.

what is in your directory data/Apex-ground-truth/?

*.lstmf files are generated from *.box and *.tiff ( or *.png) files by these lines in Makefile: https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L250-L263

%.lstmf: %.tif %.box
	tesseract "$<" $* --psm $(PSM) lstm.train

for each pair of box and image files there will be one *.lstmf file. PSM is set to 13 (as can be seen above in Makefile). The file lstm.train is probably this file: https://github.com/tesseract-ocr/tessconfigs/blob/main/configs/lstm.train

Shreeshrii changed the title ~~LSTM: Tutorial - missing file /langdata/radical-stroke.txt~~ LSTM: Training - missing file /langdata/radical-stroke.txt Dec 7, 2016

theraysmith closed this as completed Dec 7, 2016

Shreeshrii mentioned this issue Jan 9, 2017

LSTM: Training - Eval not run from trainer #644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM: Training - missing file /langdata/radical-stroke.txt #542

LSTM: Training - missing file /langdata/radical-stroke.txt #542

Shreeshrii commented Dec 7, 2016 •

edited

Loading

theraysmith commented Dec 7, 2016

Shreeshrii commented Dec 7, 2016 via email

theraysmith commented Dec 7, 2016

Shreeshrii commented Dec 8, 2016 •

edited

Loading

soumen100 commented Nov 22, 2022

soumen100 commented Nov 22, 2022

Ham714 commented Jan 10, 2023

nkrot commented Nov 27, 2023

manuthvann216 commented Jan 22, 2024

nkrot commented Mar 1, 2024

LSTM: Training - missing file /langdata/radical-stroke.txt #542

LSTM: Training - missing file /langdata/radical-stroke.txt #542

Comments

Shreeshrii commented Dec 7, 2016 • edited Loading

theraysmith commented Dec 7, 2016

Shreeshrii commented Dec 7, 2016 via email

theraysmith commented Dec 7, 2016

Shreeshrii commented Dec 8, 2016 • edited Loading

soumen100 commented Nov 22, 2022

soumen100 commented Nov 22, 2022

Ham714 commented Jan 10, 2023

nkrot commented Nov 27, 2023

manuthvann216 commented Jan 22, 2024

nkrot commented Mar 1, 2024

Shreeshrii commented Dec 7, 2016 •

edited

Loading

Shreeshrii commented Dec 8, 2016 •

edited

Loading