Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Training - Box file format #670

Closed
Shreeshrii opened this issue Jan 21, 2017 · 11 comments
Closed

LSTM: Training - Box file format #670

Shreeshrii opened this issue Jan 21, 2017 · 11 comments

Comments

@Shreeshrii
Copy link
Collaborator

@theraysmith

Two different types of box file formats are mentioned in Training Tesseract 4.0 wiki.

Please see attached and confirm the format (specially for the Wordstr format). The lstmf files created by the two box/tiff pairs are different in size, even though they are for the same tif file.

frk.embedsiver.exp0.zip

@Shreeshrii
Copy link
Collaborator Author

When using the WordStr format in one of the box files,

WordStr 1350 3106 1755 3190 0 #Personer.
WordStr 895 2861 1194 2927 0 #Møller. 
WordStr 895 2742 1528 2811 0 #Emilie, hans Kone.
WordStr 899 2618 1507 2691 0 #Birch, Cancelliraad.
WordStr 894 2497 1546 2567 0 #Laura, hans Datter.
WordStr 895 2377 1317 2447 0 #Fru Krogh.
WordStr 897 2256 1724 2329 0 #Otto Rosen, Fuldmægtig.
WordStr 896 2134 1759 2207 0 #Anders,Tjener hos Birch.
WordStr 898 2015 1669 2085 0 #EnTjener hos Møller.
WordStr 696 1746 2422 1821 0 #Handlingen foregaaer i Kjøbenhavn, i Slutningen af 1848.

I get an error (utf8 buffer too big) during processing and the unicharset is not built fully (stops at that line and does not process other box files, but does not stop)

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 21 18:53:04 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.hxOIFoYXPH/frk/ /tmp/tmp.hxOIFoYXPH/frk/frk.embedsiver.exp0.box /tmp/tmp.hxOIFoYXPH/frk/
frk.embedsiverline.exp0.box /tmp/tmp.hxOIFoYXPH/frk/frk.UnifrakturMaguntia.exp0.box /tmp/tmp.hxOIFoYXPH/frk/frk.Walbaum-Fraktur.exp0.box
**Utf8 buffer too big, size=57 for Handlingen foregaaer i Kjøbenhavn, i Slutningen af 1848.**
Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.embedsiver.exp0.box
**Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.embedsiverline.exp0.box**
Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.UnifrakturMaguntia.exp0.box
Extracting unicharset from /tmp/tmp.hxOIFoYXPH/frk/frk.Walbaum-Fraktur.exp0.box
Wrote unicharset file /tmp/tmp.hxOIFoYXPH/frk//unicharset.
[Sat Jan 21 18:53:04 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.hxOIFoYXPH/frk/frk.unicharset -O /tmp/tmp.hxOIFoYXPH/frk/frk.unicharset -X /tmp/tmp
.hxOIFoYXPH/frk/frk.xheights --script_dir=../langdata
Loaded unicharset of **size 48** from file /tmp/tmp.hxOIFoYXPH/frk/frk.unicharset

If I do not use this box file, then the unicharset is built with all of the box files

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 21 18:58:01 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.wyo1280N2G/frk/ /tmp/tmp.wyo1280N2G/frk/frk.embedsiver.exp0.box /tmp/tmp.wyo1280N2G/frk/
frk.UnifrakturMaguntia.exp0.box /tmp/tmp.wyo1280N2G/frk/frk.Walbaum-Fraktur.exp0.box
Extracting unicharset from /tmp/tmp.wyo1280N2G/frk/frk.embedsiver.exp0.box
Extracting unicharset from /tmp/tmp.wyo1280N2G/frk/frk.UnifrakturMaguntia.exp0.box
Extracting unicharset from /tmp/tmp.wyo1280N2G/frk/frk.Walbaum-Fraktur.exp0.box
Wrote unicharset file /tmp/tmp.wyo1280N2G/frk//unicharset.
[Sat Jan 21 18:58:02 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.wyo1280N2G/frk/frk.unicharset -O /tmp/tmp.wyo1280N2G/frk/frk.unicharset -X /tmp/tmp
.wyo1280N2G/frk/frk.xheights --script_dir=../langdata
Loaded unicharset of **size 143** from file /tmp/tmp.wyo1280N2G/frk/frk.unicharset
Setting unichar properties

@amitdo
Copy link
Collaborator

amitdo commented Jan 21, 2017

* multi word line -> #m u l t i w o r d l i n e

@Shreeshrii
Copy link
Collaborator Author

@amitdo Thanks for pointing out that the string needs to be space delimited. I tried with that version also, it is also getting an error...

Ref: https://github.com/amitdo/tesseract/issues/3#issuecomment-274262671

@amitdo
Copy link
Collaborator

amitdo commented Jan 23, 2017

I updated the relevant wiki section.

unicharset_extractor needs some more code to read the (WordStr) textlines-based box file format right.

@theraysmith
Copy link
Contributor

@amitdo is correct. unicharset_extractor doesn't read the WordStr box file format.
Sorry this is an un-tested path.
Furthermore, it isn't just a case of modifying unicharset_extractor.
For the Indic languages, the unicharset needs to know the syllable/grapheme clusters, and it can't get that from the Wordstr box file format. The best it can do is extract the unicodes used in the WordStr box file or you start with an existing unicharset for that training path.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented May 11, 2017

Related - #832

@Shreeshrii
Copy link
Collaborator Author

@theraysmith Please also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Xu4_aOCFhlQ/Yb2G59zTAgAJ about
Training Tesseract4.0 (LSTM) on word level bounding boxes

Will this be addressed when you update the unicharset_extractor?

I am wondering whether there is a way to use the text2image algorithm to create box files given image and ground truth files.

@amitdo
Copy link
Collaborator

amitdo commented Aug 11, 2017

Ray, please consider a new box format with new name - ''<...>-linebox' for training the LSTM engine, For example see here:
#832 (comment)

@ghost
Copy link

ghost commented Jun 11, 2018

@Shreeshrii @amitdo any updates regarding this matter?

@Chandra-cc
Copy link

Are there difference between box file formats of tesseract 3 and tesseract 4? Or we can use box and tiff pairs of tesseract 3 to train tessearact 4 and can we use the starter trained data of tesseract 4 generated using the tesstrain.sh command.

@Shreeshrii
Copy link
Collaborator Author

Wordstr format done via 15f2a4b

lstmbox format done via 2ae65b2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants