Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

Closed
Marco-Parente opened this issue Jan 31, 2022 · 2 comments
Labels
stale Issues which require input by the reporter which is not provided

Comments

@Marco-Parente
Copy link

Hello, good morning :)

I'm wanting kind of a general help, cause I'm a bit lost, so sorry if it's something dumb.

I'm wanting to train the tesseract model to be good with brazilian car lincese-plate characters, so i've used a regex text generator to generate 100.000 lines of characters in a way they have a format kinda like what we use here...

Then, i have downloaded the lstm files and Inside the eng lstm folder, i've replaced the content inside eng.training_text with the plate-like text I generated, cause I the previous content would have characters and text format I won't use (I only need the [A-Z] and [0-9] characters)

After that, i used the following:

python3 src/training/tesstrain.py --fonts_dir /Documents/dev/tesseract-tutorial/fonts --fontlist 'FE-Font' 'Mandatory' --lang eng --linedata_only --langdata_dir /Documents/dev/tesseract-tutorial/langdata_lstm --tessdata_dir /Documents/dev/tesseract-tutorial/tesseract/tessdata --save_box_tiff --maxpages 200 --output_dir train

I've put the eng lang cause its needed, but there would be no specific language actually there, cause its only license plates characters... right?

After this command, i got tif, lstmf and box files for the fonts I've used, but they have multiple lines and multiple pages (200)

After looking the docs for a while, I've seen that with #7 script you can transform png pages to tif one-line-image with the respective transcriptions... but i didn't see a way to do that with tif images

So I wanted to ask the following:

  • Can I make the tif generation without specifying the language?
  • Can i take the multi-line / multi-page tif files and transform it to one-line tif that are needed for training?
  • Am I doing this training process the right way or am I complicating things?

Thanks in advance!

@Shreeshrii
Copy link
Collaborator

tesstrain.py creates the lstmf files which can be directly used by lstmtraining. However, the tesstrain Makefile does not directly support those.

Please see
https://github.com/Shreeshrii/tess5train-fonts/blob/main/license_plate.sh and
https://github.com/Shreeshrii/tess5train-fonts/tree/main/data/BrazilPlates
https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/BrazilPlates/plots/BrazilPlates-6.png

These show result of a test training I did by finetuning eng.traineddata.

As the plot shows, waiting for training to reach the target error rate leads to overfitting. Best results may be seen by using the traineddata files from the 400-700 checkpoints. You can test with real life images and verify results.

Also, as @stweil had mentioned recently in a related thread, you can finetune with 100+ real life single line images of license plates and their ground-truth using tesstrain Makefile.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label Apr 16, 2022
@stale stale bot closed this as completed Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

2 participants