LSTM: Training - Box file format #670

Shreeshrii · 2017-01-21T11:22:37Z

Two different types of box file formats are mentioned in Training Tesseract 4.0 wiki.

Please see attached and confirm the format (specially for the Wordstr format). The lstmf files created by the two box/tiff pairs are different in size, even though they are for the same tif file.

frk.embedsiver.exp0.zip

Shreeshrii · 2017-01-21T13:38:23Z

When using the WordStr format in one of the box files,

WordStr 1350 3106 1755 3190 0 #Personer.
WordStr 895 2861 1194 2927 0 #Møller. 
WordStr 895 2742 1528 2811 0 #Emilie, hans Kone.
WordStr 899 2618 1507 2691 0 #Birch, Cancelliraad.
WordStr 894 2497 1546 2567 0 #Laura, hans Datter.
WordStr 895 2377 1317 2447 0 #Fru Krogh.
WordStr 897 2256 1724 2329 0 #Otto Rosen, Fuldmægtig.
WordStr 896 2134 1759 2207 0 #Anders,Tjener hos Birch.
WordStr 898 2015 1669 2085 0 #EnTjener hos Møller.
WordStr 696 1746 2422 1821 0 #Handlingen foregaaer i Kjøbenhavn, i Slutningen af 1848.

I get an error (utf8 buffer too big) during processing and the unicharset is not built fully (stops at that line and does not process other box files, but does not stop)

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 21 18:53:04 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.hxOIFoYXPH/frk/ /tmp/tmp.hxOIFoYXPH/frk/frk.embedsiver.exp0.box /tmp/tmp.hxOIFoYXPH/frk/
frk.embedsiverline.exp0.box /tmp/tmp.hxOIFoYXPH/frk/frk.UnifrakturMaguntia.exp0.box /tmp/tmp.hxOIFoYXPH/frk/frk.Walbaum-Fraktur.exp0.box
**Utf8 buffer too big, size=57 for Handlingen foregaaer i Kjøbenhavn, i Slutningen af 1848.**
Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.embedsiver.exp0.box
**Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.embedsiverline.exp0.box**
Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.UnifrakturMaguntia.exp0.box
Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.Walbaum-Fraktur.exp0.box
Wrote unicharset file /tmp/tmp.hxOIFoYXPH/frk//unicharset.
[Sat Jan 21 18:53:04 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.hxOIFoYXPH/frk/frk.unicharset -O /tmp/tmp.hxOIFoYXPH/frk/frk.unicharset -X /tmp/tmp
.hxOIFoYXPH/frk/frk.xheights --script_dir=../langdata
Loaded unicharset of **size 48** from file /tmp/tmp.hxOIFoYXPH/frk/frk.unicharset

If I do not use this box file, then the unicharset is built with all of the box files

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 21 18:58:01 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.wyo1280N2G/frk/ /tmp/tmp.wyo1280N2G/frk/frk.embedsiver.exp0.box /tmp/tmp.wyo1280N2G/frk/
frk.UnifrakturMaguntia.exp0.box /tmp/tmp.wyo1280N2G/frk/frk.Walbaum-Fraktur.exp0.box
Extracting unicharset from /tmp/tmp.wyo1280N2G/frk/frk.embedsiver.exp0.box
Extracting unicharset from /tmp/tmp.wyo1280N2G/frk/frk.UnifrakturMaguntia.exp0.box
Extracting unicharset from /tmp/tmp.wyo1280N2G/frk/frk.Walbaum-Fraktur.exp0.box
Wrote unicharset file /tmp/tmp.wyo1280N2G/frk//unicharset.
[Sat Jan 21 18:58:02 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.wyo1280N2G/frk/frk.unicharset -O /tmp/tmp.wyo1280N2G/frk/frk.unicharset -X /tmp/tmp
.wyo1280N2G/frk/frk.xheights --script_dir=../langdata
Loaded unicharset of **size 143** from file /tmp/tmp.wyo1280N2G/frk/frk.unicharset
Setting unichar properties

amitdo · 2017-01-21T17:29:06Z

tesseract/ccmain/applybox.cpp

Line 71 in a75ab45

* multi word line -> #m u l t i w o r d l i n e

Shreeshrii · 2017-01-22T11:48:15Z

@amitdo Thanks for pointing out that the string needs to be space delimited. I tried with that version also, it is also getting an error...

Ref: https://github.com/amitdo/tesseract/issues/3#issuecomment-274262671

amitdo · 2017-01-23T11:01:49Z

I updated the relevant wiki section.

unicharset_extractor needs some more code to read the (WordStr) textlines-based box file format right.

theraysmith · 2017-01-23T20:01:57Z

@amitdo is correct. unicharset_extractor doesn't read the WordStr box file format.
Sorry this is an un-tested path.
Furthermore, it isn't just a case of modifying unicharset_extractor.
For the Indic languages, the unicharset needs to know the syllable/grapheme clusters, and it can't get that from the Wordstr box file format. The best it can do is extract the unicodes used in the WordStr box file or you start with an existing unicharset for that training path.

Shreeshrii · 2017-05-11T11:21:21Z

Related - #832

Shreeshrii · 2017-08-11T03:51:07Z

@theraysmith Please also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Xu4_aOCFhlQ/Yb2G59zTAgAJ about
Training Tesseract4.0 (LSTM) on word level bounding boxes

Will this be addressed when you update the unicharset_extractor?

I am wondering whether there is a way to use the text2image algorithm to create box files given image and ground truth files.

amitdo · 2017-08-11T05:28:42Z

Ray, please consider a new box format with new name - ''<...>-linebox' for training the LSTM engine, For example see here:
#832 (comment)

ghost · 2018-06-11T14:45:29Z

@Shreeshrii @amitdo any updates regarding this matter?

Chandra-cc · 2018-06-14T05:41:23Z

Are there difference between box file formats of tesseract 3 and tesseract 4? Or we can use box and tiff pairs of tesseract 3 to train tessearact 4 and can we use the starter trained data of tesseract 4 generated using the tesstrain.sh command.

Shreeshrii · 2019-02-25T12:33:52Z

Wordstr format done via 15f2a4b

lstmbox format done via 2ae65b2

Shreeshrii mentioned this issue Mar 3, 2017

LSTM: Training - Compute CTC targets failed! #591

Closed

amitdo mentioned this issue May 2, 2017

LSTM : Training - Support WordStr Box file option #832

Closed

Shreeshrii closed this as completed May 11, 2017

Shreeshrii reopened this Aug 11, 2017

Shreeshrii mentioned this issue Feb 10, 2019

Add renderer to create WordStr box files from images #2231

Merged

Shreeshrii closed this as completed Feb 25, 2019

amitdo added the feature request label Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM: Training - Box file format #670

LSTM: Training - Box file format #670

Shreeshrii commented Jan 21, 2017

Shreeshrii commented Jan 21, 2017

amitdo commented Jan 21, 2017

Shreeshrii commented Jan 22, 2017

amitdo commented Jan 23, 2017 •

edited

Loading

theraysmith commented Jan 23, 2017

Shreeshrii commented May 11, 2017 •

edited

Loading

Shreeshrii commented Aug 11, 2017

amitdo commented Aug 11, 2017 •

edited

Loading

ghost commented Jun 11, 2018

Chandra-cc commented Jun 14, 2018

Shreeshrii commented Feb 25, 2019

LSTM: Training - Box file format #670

LSTM: Training - Box file format #670

Comments

Shreeshrii commented Jan 21, 2017

Shreeshrii commented Jan 21, 2017

amitdo commented Jan 21, 2017

Shreeshrii commented Jan 22, 2017

amitdo commented Jan 23, 2017 • edited Loading

theraysmith commented Jan 23, 2017

Shreeshrii commented May 11, 2017 • edited Loading

Shreeshrii commented Aug 11, 2017

amitdo commented Aug 11, 2017 • edited Loading

ghost commented Jun 11, 2018

Chandra-cc commented Jun 14, 2018

Shreeshrii commented Feb 25, 2019

amitdo commented Jan 23, 2017 •

edited

Loading

Shreeshrii commented May 11, 2017 •

edited

Loading

amitdo commented Aug 11, 2017 •

edited

Loading