[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

Marco-Parente · 2022-01-31T17:00:13Z

Hello, good morning :)

I'm wanting kind of a general help, cause I'm a bit lost, so sorry if it's something dumb.

I'm wanting to train the tesseract model to be good with brazilian car lincese-plate characters, so i've used a regex text generator to generate 100.000 lines of characters in a way they have a format kinda like what we use here...

Then, i have downloaded the lstm files and Inside the eng lstm folder, i've replaced the content inside eng.training_text with the plate-like text I generated, cause I the previous content would have characters and text format I won't use (I only need the [A-Z] and [0-9] characters)

After that, i used the following:

python3 src/training/tesstrain.py --fonts_dir /Documents/dev/tesseract-tutorial/fonts --fontlist 'FE-Font' 'Mandatory' --lang eng --linedata_only --langdata_dir /Documents/dev/tesseract-tutorial/langdata_lstm --tessdata_dir /Documents/dev/tesseract-tutorial/tesseract/tessdata --save_box_tiff --maxpages 200 --output_dir train

I've put the eng lang cause its needed, but there would be no specific language actually there, cause its only license plates characters... right?

After this command, i got tif, lstmf and box files for the fonts I've used, but they have multiple lines and multiple pages (200)

After looking the docs for a while, I've seen that with #7 script you can transform png pages to tif one-line-image with the respective transcriptions... but i didn't see a way to do that with tif images

So I wanted to ask the following:

Can I make the tif generation without specifying the language?
Can i take the multi-line / multi-page tif files and transform it to one-line tif that are needed for training?
Am I doing this training process the right way or am I complicating things?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2022-02-05T13:40:43Z

tesstrain.py creates the lstmf files which can be directly used by lstmtraining. However, the tesstrain Makefile does not directly support those.

Please see
https://github.com/Shreeshrii/tess5train-fonts/blob/main/license_plate.sh and
https://github.com/Shreeshrii/tess5train-fonts/tree/main/data/BrazilPlates
https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/BrazilPlates/plots/BrazilPlates-6.png

These show result of a test training I did by finetuning eng.traineddata.

As the plot shows, waiting for training to reach the target error rate leads to overfitting. Best results may be seen by using the traineddata files from the 400-700 checkpoints. You can test with real life images and verify results.

Also, as @stweil had mentioned recently in a related thread, you can finetune with 100+ real life single line images of license plates and their ground-truth using tesstrain Makefile.

stale · 2022-04-16T08:47:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale Issues which require input by the reporter which is not provided label Apr 16, 2022

stale bot closed this as completed Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

Marco-Parente commented Jan 31, 2022

Shreeshrii commented Feb 5, 2022

stale bot commented Apr 16, 2022

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

Comments

Marco-Parente commented Jan 31, 2022

Shreeshrii commented Feb 5, 2022

stale bot commented Apr 16, 2022