Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show names of failing lstmf files in error messages #3251

Merged
merged 2 commits into from
Jan 20, 2021

Conversation

stweil
Copy link
Member

@stweil stweil commented Jan 20, 2021

No description provided.

@stweil
Copy link
Member Author

stweil commented Jan 20, 2021

Especially "Compute CTC targets failed" is reported quite often. Hopefully the enhanced error messages help users to fix the problems in their training data.

@Shreeshrii
Copy link
Collaborator

This is a welcome addition. It will reduce a lot of user queries too. Thanks @stweil.

@Shreeshrii
Copy link
Collaborator

Question: WIll it also identify a line number for those who use multi-page tifs created by text2image via tesstrain.sh?

@egorpugin egorpugin merged commit e285261 into tesseract-ocr:master Jan 20, 2021
@stweil
Copy link
Member Author

stweil commented Jan 20, 2021

Question: WIll it also identify a line number for those who use multi-page tifs created by text2image via tesstrain.sh?

A page number for multi-page TIFF files is also available, but currently not shown in the error messages. It can easily be added. How should the error message look like? Can we use the same message format for single line images and multi-page TIFFS, or should it be different?

@Shreeshrii
Copy link
Collaborator

Currently during lstmtraining with multipage tifs, debug info is shown as below:

Iteration 25135: GROUND  TRUTH : ميلاعت تءاج دقو . ةيما ينب ديب تناك شيرق يف ايلعلا ةطلسلا نا ريغ لئابقلا
File /tmp/ara-2021-01-20.hRX/ara.Amiri.exp0.lstmf line 1057 (Perfect):
Mean rms=0.462%, delta=0.428%, train=1.369%(3.172%), skip ratio=0%
Iteration 25136: GROUND  TRUTH : ٢
File /tmp/ara-2021-01-20.hRX/ara.Amiri.exp0.lstmf line 5256 (Perfect):
Mean rms=0.462%, delta=0.428%, train=1.369%(3.172%), skip ratio=0%
Iteration 25137: GROUND  TRUTH : ةفاطللا ةياغ يف امهناف مرفسهاشلاو درولاک ادًج ةريثك ةرخبأ اهنم ينحل يتلا ءايشالا مشو . هطشمو
Iteration 25137: BEST OCR TEXT : ةفاطللا ةياغ يف امهناف مرفسهاشلاو درولاك ادج ةريثك ةرخبأ اهنم ينحل يتلا ءايشالا مشو . هطشمو
File /tmp/ara-2021-01-20.hRX/ara.Amiri.exp0.lstmf line 303 :
Mean rms=0.463%, delta=0.43%, train=1.372%(3.184%), skip ratio=0%
Iteration 25138: GROUND  TRUTH : نسحلا ىدل درو امب دورابلا حلمب مهتفرعم ديأتت نيذلا . يمالسالا ملاعلا - ٢
File /tmp/ara-2021-01-20.hRX/ara.Amiri.exp0.lstmf line 8383 (Perfect)

I think the page number = line number as shown above. A format consistent with the above should be ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants