-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: Training - Box file format #670
Comments
When using the WordStr format in one of the box files,
I get an error (utf8 buffer too big) during processing and the unicharset is not built fully (stops at that line and does not process other box files, but does not stop)
If I do not use this box file, then the unicharset is built with all of the box files
|
Line 71 in a75ab45
|
@amitdo Thanks for pointing out that the string needs to be space delimited. I tried with that version also, it is also getting an error... Ref: https://github.com/amitdo/tesseract/issues/3#issuecomment-274262671 |
I updated the relevant wiki section.
|
@amitdo is correct. unicharset_extractor doesn't read the WordStr box file format. |
Related - #832 |
@theraysmith Please also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Xu4_aOCFhlQ/Yb2G59zTAgAJ about Will this be addressed when you update the unicharset_extractor? I am wondering whether there is a way to use the |
Ray, please consider a new box format with new name - ''<...>-linebox' for training the LSTM engine, For example see here: |
@Shreeshrii @amitdo any updates regarding this matter? |
Are there difference between box file formats of tesseract 3 and tesseract 4? Or we can use box and tiff pairs of tesseract 3 to train tessearact 4 and can we use the starter trained data of tesseract 4 generated using the tesstrain.sh command. |
@theraysmith
Two different types of box file formats are mentioned in Training Tesseract 4.0 wiki.
Please see attached and confirm the format (specially for the Wordstr format). The lstmf files created by the two box/tiff pairs are different in size, even though they are for the same tif file.
frk.embedsiver.exp0.zip
The text was updated successfully, but these errors were encountered: