Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

box.train completes with no errors but does not create .tr output #64

Closed
danageva opened this issue Jul 25, 2015 · 25 comments
Closed

box.train completes with no errors but does not create .tr output #64

danageva opened this issue Jul 25, 2015 · 25 comments

Comments

@danageva
Copy link

Hi,
I'm running 3.05.00dev on Ubuntu 14.04 LTS.
When running:
tesseract eng.Arial.exp0.tif eng.Arial.exp0 box.train
I'm getting a simple one-line output:
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

However, no .tr output file is created (anywhere in the filesystem).

My work dir listing is:

Arial.ttf
common.punc
eng.Arial.exp0.box
eng.Arial.exp0.tif
training-text.txt

Running with gdb doesn't give anything additional.

Anything I can look for for extra info? Any ideas what might be causing this?

Thanks

@zdenop zdenop closed this as completed in 5dfb0cb Jul 25, 2015
@rodrigosalinas
Copy link

I'm having exactly the same problem than danageva with the same version but over OS X. Apparently zdenop closed the issue with a deletion in two files. I don't have those lines in my installed version. Can you help me please?

@amitdo
Copy link
Collaborator

amitdo commented Jan 18, 2016

@rodrigosalinas
The issue was solved (again).
Try latest code from the repo.

@aiwaz
Copy link

aiwaz commented Jan 27, 2016

I'm having exactly the same problem. I have the latest code from the repo.
Edit: It works with v3.04.00

@amitdo
Copy link
Collaborator

amitdo commented Jan 27, 2016

You should provide more details.

I have the latest code from the repo.

Which commit exactly, 1826ac1 ?

What OS do you use and the specific version of this OS. For example: Ubuntu 14.04, OS X 10.11, Windows 10.

Also, please provide the exact command you used and attach as zip file the tif and box files.

@aiwaz
Copy link

aiwaz commented Jan 27, 2016

git rev-parse HEAD command gives me 1826ac1

uname -a
Linux ubuntu 3.19.0-42-generic #48-Ubuntu SMP Thu Dec 17 22:54:45 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
lsb_release -a
Distributor ID: Ubuntu Description: Ubuntu 15.04 Release: 15.04 Codename: vivid

The exact command is:
tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train
The output:
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica Segmentation fault (core dumped)
If I use it with sudo:
sudo tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train
The output:
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

Required files attached.
ult.zip

@amitdo
Copy link
Collaborator

amitdo commented Jan 27, 2016

I have Ubuntu 14.04 64 bit.

Commit 1826ac1.

tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train

Output:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
APPLY_BOXES:
Boxes read from boxfile: 1360
Found 1360 good blobs.
Generated training data for 292 words
Warning in pixReadMemTiff: tiff page 1 not found

ult.dejavu.exp0.tr.zip

My guess is that you didn't install tesseract properly.

Did you run all these commands last time you installed tesseract?

./autogen.sh
./configure --disable-cube
make
sudo make install
sudo ldconfig
make training
sudo make training-install

@aiwaz
Copy link

aiwaz commented Jan 27, 2016

I installed it as:

./autogen.sh
./configure --prefix=/home/username/tessbin
make
make install
make training
make training-install

@amitdo
Copy link
Collaborator

amitdo commented Jan 27, 2016

if you have previously installed tesseract system-wide (for example, under /usr/local) the command 'tesseract ...' will use that system-wide executable.

@amitdo
Copy link
Collaborator

amitdo commented Jan 27, 2016

Try this command to see if you have another installation of Tesseract in your machine.

find /usr -name "tesseract"

@stweil
Copy link
Member

stweil commented Jan 27, 2016

... or call /home/username/tessbin/bin/tesseract directly.

@aiwaz
Copy link

aiwaz commented Jan 27, 2016

Nope, there's no other tesseract in my machine. find does not return
anything.

On 27 January 2016 at 18:36, Amit Dovev [email protected] wrote:

Try this command to see if you have another installation of Tesseract in
your machine.

find /usr -name "tesseract"


Reply to this email directly or view it on GitHub
#64 (comment)
.

@stweil
Copy link
Member

stweil commented Jan 27, 2016

@aiwaz, could you try this command:

gdb --args tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train

(Enter run on the gdb command line, and when it reports an error info stack)

@aiwaz
Copy link

aiwaz commented Jan 28, 2016

gdb --args tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train
GNU gdb (Ubuntu 7.9-1ubuntu1) 7.9
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../bin/tesseract...done.
(gdb) run
Starting program: /home/azukausk/tessbin/bin/tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

Program received signal SIGSEGV, Segmentation fault.
tesseract::TessResultRenderer::BeginDocument (this=this@entry=0x83f0b0, title=title@entry=0x7ffff78a8fea "")
    at renderer.cpp:57
57    bool ok = BeginDocumentHandler();
(gdb) info stack
#0  tesseract::TessResultRenderer::BeginDocument (this=this@entry=0x83f0b0, title=title@entry=0x7ffff78a8fea "")
    at renderer.cpp:57
#1  0x00007ffff76cee2f in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffffffe2a0, 
    filename=<optimized out>, retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, 
    renderer=0x83f0b0) at baseapi.cpp:1166
#2  0x00007ffff76cf570 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffe2a0, 
    filename=<optimized out>, retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, 
    renderer=<optimized out>) at baseapi.cpp:1074
#3  0x0000000000401f5c in main (argc=<optimized out>, argv=0x7fffffffe448) at tesseractmain.cpp:429
(gdb)

@aiwaz
Copy link

aiwaz commented Jan 28, 2016

@stweil, I always call tesseract directly. It resides in /home/username/tessbin/bin/tesseract , just as you wrote. I created an alias for it in bashrc.

@amitdo
Copy link
Collaborator

amitdo commented Jan 28, 2016

What's the output of:

printenv TESSDATA_PREFIX

tesseract --list-langs

@aiwaz
Copy link

aiwaz commented Jan 28, 2016

TESSDATA_PREFIX is empty.

tesseract --list-langs
List of available languages (1):
eng

@amitdo
Copy link
Collaborator

amitdo commented Jan 28, 2016

tesseract ult.dejavu.exp0.tif ult txt hocr

Does this command produce ult.txt and ult.hocr ?

Maybe you should try to set TESSDATA_PREFIX environment variable or use --tessdata-dir parameter.

@aiwaz
Copy link

aiwaz commented Jan 28, 2016

tesseract ult.dejavu.exp0.tif ult txt hocr
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error opening data file /home/azukausk/tessbin/share/tessdata/osd.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'osd'
Tesseract couldn't load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load
Warning in pixReadMemTiff: tiff page 1 not found

Why it wants to load the "osd" language?

The files are produced:

ll
total 21500
drwxrwxr-x 2 azukausk azukausk     4096 Jan 27 13:49 configs
-rw-rw-r-- 1 azukausk azukausk 21876550 Jan 25 15:41 eng.traineddata
-rw-r--r-- 1 azukausk azukausk      568 Jan 27 13:49 pdf.ttf
drwxrwxr-x 2 azukausk azukausk     4096 Jan 27 13:49 tessconfigs
-rw-rw-r-- 1 azukausk azukausk     1649 Jan 25 15:56 training_text
-rw-rw-r-- 1 azukausk azukausk    38706 Jan 27 12:53 ult.dejavu.exp0.box
-rw-rw-r-- 1 azukausk azukausk    39984 Jan 27 12:53 ult.dejavu.exp0.tif
-rw-rw-r-- 1 azukausk azukausk    36383 Jan 28 12:03 ult.hocr
-rw-rw-r-- 1 azukausk azukausk     1685 Jan 28 12:03 ult.txt

When I set the TESSDATA_PREFIX variable the output is the same:

TESSDATA_PREFIX=/home/azukausk/tessbin/share/tessdata
echo $TESSDATA_PREFIX
/home/azukausk/tessbin/share/tessdata
tesseract ult.dejavu.exp0.tif ult txt hocr
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error opening data file /home/azukausk/tessbin/share/tessdata/osd.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'osd'
Tesseract couldn't load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load
Warning in pixReadMemTiff: tiff page 1 not found

And one more question: when I configure tesseract with --prefix option, I expect tesseract to be smart enough to know where my data are. Why do I have to set TESSDATA_PREFIX variable explicitly?

@stweil
Copy link
Member

stweil commented Jan 28, 2016

Setting TESSDATA_PREFIX is not needed as long as your tessdata directory is at the right place ($PREFIX/share/tessdata).

osd.traineddata will be used for auto orientation and script detection, no matter which language you have selected.

@amitdo
Copy link
Collaborator

amitdo commented Jan 28, 2016

OK, I was able to reproduce the issue.

I will try to figure out why this is happening when installing tesseract in
/home/myusername/tessbin

@amitdo
Copy link
Collaborator

amitdo commented Jan 28, 2016

I solved @aiwaz issue. More info will follow later.

@amitdo
Copy link
Collaborator

amitdo commented Jan 29, 2016

@aiwaz, please test my fix.

zdenop added a commit that referenced this issue Jan 31, 2016
@amitdo
Copy link
Collaborator

amitdo commented Jan 31, 2016

zdenko merged my fix to master.

@aiwaz
Copy link

aiwaz commented Feb 1, 2016

I tested the fix and confirm that tesseract ult.dejavu.exp0.tif ult.dejavu.exp0 box.train now produces .tr file.
Thank you guys.

@amitdo
Copy link
Collaborator

amitdo commented Feb 1, 2016

👍
Thanks for reporting the issue.

zdenop pushed a commit that referenced this issue Feb 5, 2016
This commit is better than 06fc053. Hopefully, this is the last fix to box training issue.
@amitdo amitdo added the bug label May 27, 2016
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
This commit is better than 06fc053. Hopefully, this is the last fix to box training issue.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
This commit is better than 06fc053. Hopefully, this is the last fix to box training issue.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
This commit is better than 06fc053. Hopefully, this is the last fix to box training issue.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
This commit is better than 06fc053. Hopefully, this is the last fix to box training issue.
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants