Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Training - missing file /langdata/radical-stroke.txt #542

Closed
Shreeshrii opened this issue Dec 7, 2016 · 10 comments
Closed

LSTM: Training - missing file /langdata/radical-stroke.txt #542

Shreeshrii opened this issue Dec 7, 2016 · 10 comments

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 7, 2016

Other case É of é is not in unicharset
Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Failed to load radical-stroke info from: ../langdata/radical-stroke.txt
Warning: given outputs 105 not equal to unicharset of 113.

@theraysmith I am trying to run the commands given in training tutorial.

  1. the above messages are from basetrain.log.
    Does the langdata repo need to be updated for 4.0 alpha?

.

@Shreeshrii Shreeshrii changed the title LSTM: Tutorial - missing file /langdata/radical-stroke.txt LSTM: Training - missing file /langdata/radical-stroke.txt Dec 7, 2016
@theraysmith
Copy link
Contributor

Fixed in tesseract-ocr/langdata@3299c60.
I'm retesting now. It seems the tutorial works without it, so I imagine the accuracy numbers in the tutorial will come out different.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 7, 2016 via email

@theraysmith
Copy link
Contributor

I've just updated the numbers in the training tutorial.
It seems to work apart from one thing that needs looking at - it doesn't run the eval from the trainer.
It doesn't harm the tutorial, but will be required before people start serious training.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 8, 2016

Ray,

Please add what are the minimum requirements for doing LSTM training in terms of hardware, software, etc.

I realized after running the process that I needed to build Scrollview.jar. I am not sure whether it is REQUIRED or only optional for those who would like to see visual debugging output. It is not built as part of the regular make install of tesseract and training tools.

It will stop at 5000 iterations, (in about half an hour)

I think that is probably dependent on the hardware used. I did not get any progress for more than one and a half hour - not sure whether it was because I did not have scrollview.jar at that point. I ran it later with 500 iterations.

I think it maybe helpful to have just a single iteration as the first step in tutorial to make sure that the process is working.

Also, the case that I think most people would like to use for LSTM training would be to use Finetuning to add a font to the existing trainingdata. It would be helpful to have a separate page on wiki for it.

It would also be great to know how to add training data based on scanned images for typefaces that are not available as fonts.

I will try to test 'finetuning' the Hindi traineddata for Sanskrit and post here.

@soumen100
Copy link

p;lease help me out :

Config file is optional, continuing...
Failed to read data from: data/langdata/Apex/Apex.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

@soumen100
Copy link

tesseract "data/Apex-ground-truth/eng_64.tif" data/Apex-ground-truth/eng_64 --psm 13 lstm.train

  • tesseract data/Apex-ground-truth/eng_64.tif data/Apex-ground-truth/eng_64 --psm 13 lstm.train
    set -x;
    tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train
  • tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
    python3 shuffle.py 0 "data/Apex/all-lstmf"
  • head -n 90 data/Apex/all-lstmf
  • tail -n 10 data/Apex/all-lstmf
    combine_lang_model
    --input_unicharset data/Apex/unicharset
    --script_dir data/langdata
    --numbers data/Apex/Apex.numbers
    --puncs data/Apex/Apex.punc
    --words data/Apex/Apex.wordlist
    --output_dir data

    --lang Apex
    Failed to read data from: data/Apex/Apex.wordlist
    Failed to read data from: data/Apex/Apex.punc
    Failed to read data from: data/Apex/Apex.numbers
    Loaded unicharset of size 113 from file data/Apex/unicharset
    Setting unichar properties
    Other case É of é is not in unicharset
    Other case FI of fi is not in unicharset
    Setting script properties
    Failed to load script unicharset from:data/langdata/Latin.unicharset
    Warning: properties incomplete for index 3 = C
    Warning: properties incomplete for index 4 = H
    Warning: properties incomplete for index 5 = E
    Warning: properties incomplete for index 6 = S
    Warning: properties incomplete for index 7 = -
    Warning: properties incomplete for index 8 = R
    Warning: properties incomplete for index 9 = I
    Warning: properties incomplete for index 10 = K

@Ham714
Copy link

Ham714 commented Jan 10, 2023

tesseract "data/Apex-ground-truth/eng_64.tif" data/Apex-ground-truth/eng_64 --psm 13 lstm.train

  • tesseract data/Apex-ground-truth/eng_64.tif data/Apex-ground-truth/eng_64 --psm 13 lstm.train
    set -x;
    tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train

  • tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
    python3 shuffle.py 0 "data/Apex/all-lstmf"

  • head -n 90 data/Apex/all-lstmf

  • tail -n 10 data/Apex/all-lstmf
    combine_lang_model
    --input_unicharset data/Apex/unicharset
    --script_dir data/langdata
    --numbers data/Apex/Apex.numbers
    --puncs data/Apex/Apex.punc
    --words data/Apex/Apex.wordlist
    --output_dir data

    --lang Apex
    Failed to read data from: data/Apex/Apex.wordlist
    Failed to read data from: data/Apex/Apex.punc
    Failed to read data from: data/Apex/Apex.numbers
    Loaded unicharset of size 113 from file data/Apex/unicharset
    Setting unichar properties
    Other case É of é is not in unicharset
    Other case FI of fi is not in unicharset
    Setting script properties
    Failed to load script unicharset from:data/langdata/Latin.unicharset
    Warning: properties incomplete for index 3 = C
    Warning: properties incomplete for index 4 = H
    Warning: properties incomplete for index 5 = E
    Warning: properties incomplete for index 6 = S
    Warning: properties incomplete for index 7 = -
    Warning: properties incomplete for index 8 = R
    Warning: properties incomplete for index 9 = I
    Warning: properties incomplete for index 10 = K

did you find the solution

@nkrot
Copy link

nkrot commented Nov 27, 2023

p;lease help me out :

Config file is optional, continuing... Failed to read data from: data/langdata/Apex/Apex.config Failed to read data from: data/langdata/radical-stroke.txt Error reading radical code table data/langdata/radical-stroke.txt make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

Before running make training, please run make tesseract-langdata. It will download the file radical-stroke.txt as well as many SCRIPT.unicharset files into your data/langdata directory.

@manuthvann216
Copy link

@nkrot

hello dear , thx for ur suggestion , I did follow it and currently I do have langdata downloaded into tesstrain but once I run this cmd :
make training MODEL_NAME=Apex TESSDATA=tessdata_best MAX_ITERATIONS=10

I got this error
image

I'd greatly appreciate ur help alot

@nkrot
Copy link

nkrot commented Mar 1, 2024

@manuthvann216

I cannot give you an easy answer to your question. I am still learning how to train tesseract and I feel like Hercules in Augean stables.

what is in your directory data/Apex-ground-truth/?

*.lstmf files are generated from *.box and *.tiff ( or *.png) files by these lines in Makefile: https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L250-L263

%.lstmf: %.tif %.box
	tesseract "$<" $* --psm $(PSM) lstm.train

for each pair of box and image files there will be one *.lstmf file. PSM is set to 13 (as can be seen above in Makefile). The file lstm.train is probably this file: https://github.com/tesseract-ocr/tessconfigs/blob/main/configs/lstm.train

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants