Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

box.train segfault #57

Closed
jbreiden opened this issue Jul 20, 2015 · 14 comments
Closed

box.train segfault #57

jbreiden opened this issue Jul 20, 2015 · 14 comments

Comments

@jbreiden
Copy link
Contributor

I have no idea what the box.train config is supposed to do, or what
missing data it needs. I just don't like segfaults.

(gdb) run testing/phototest.tif - box.train
Starting program: /tmp/plang/tesseract-3.04.00/api/.libs/lt-tesseract testing/phototest.tif - box.train
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Page 1

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff760fe94 in ELIST_ITERATOR::set_to_list (this=0x7fffffffd3e0, list_to_iterate=0x8) at ../ccutil/elst.h:308
308   prev = list->last;
(gdb) backtrace
#0  0x00007ffff760fe94 in ELIST_ITERATOR::set_to_list (this=0x7fffffffd3e0, list_to_iterate=0x8)
    at ../ccutil/elst.h:308
#1  0x00007ffff77dd367 in PAGE_RES_IT::start_page (this=0x7fffffffd390, empty_ok=false) at pageres.cpp:1510
#2  0x00007ffff76116c7 in PAGE_RES_IT::restart_page (this=0x7fffffffd390) at ../ccstruct/pageres.h:681
#3  0x00007ffff76116a7 in PAGE_RES_IT::PAGE_RES_IT (this=0x7fffffffd390, the_page_res=0x0) at ../ccstruct/pageres.h:665
#4  0x00007ffff761e84f in tesseract::Tesseract::ApplyBoxTraining (this=0x808c00, fontname=..., page_res=0x0)
    at applybox.cpp:780
#5  0x00007ffff7609478 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffd9f0, monitor=0x0) at baseapi.cpp:883
#6  0x00007ffff760a4d9 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffd9f0, pix=0x83fa10, page_index=0, 
    filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, timeout_millisec=0, renderer=0x13850f0)
    at baseapi.cpp:1222
#7  0x00007ffff7609d4e in tesseract::TessBaseAPI::ProcessPagesMultipageTiff (this=0x7fffffffd9f0, 
    data=0x138fc08 "II*", size=38668, filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, 
    timeout_millisec=0, renderer=0x13850f0, tessedit_page_number=-1) at baseapi.cpp:1057
#8  0x00007ffff760a29b in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffd9f0, 
    filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, timeout_millisec=0, renderer=0x13850f0)
    at baseapi.cpp:1176
#9  0x00007ffff7609dc5 in tesseract::TessBaseAPI::ProcessPages (this=0x7fffffffd9f0, 
    filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, timeout_millisec=0, renderer=0x13850f0)
    at baseapi.cpp:1074
#10 0x00000000004031a3 in main (argc=4, argv=0x7fffffffe5f8) at tesseractmain.cpp:316
...
@jimregan
Copy link
Contributor

Not enough input. In short, box.train needs both an image, and a box file, and from those it creates training data. For a more complete explanation, see the wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#run-tesseract-for-training

@jbreiden
Copy link
Contributor Author

Tesseract should return an error when there is insufficient input, not segfault.

@jimregan
Copy link
Contributor

Re-opening, as requested.

@jimregan jimregan reopened this Jul 20, 2015
@jimregan
Copy link
Contributor

Something like this? (Using the variable also supports box.train.stderr)

diff --git a/api/tesseractmain.cpp b/api/tesseractmain.cpp
index e7abadf..7ddbc93 100644
--- a/api/tesseractmain.cpp
+++ b/api/tesseractmain.cpp
@@ -306,6 +306,11 @@ int main(int argc, char **argv) {
   if (b) renderers.push_back(new tesseract::TessBoxTextRenderer(outputbase));
   api.GetBoolVariable("tessedit_create_txt", &b);
   if (b) renderers.push_back(new tesseract::TessTextRenderer(outputbase));
+  api.GetBoolVariable("tessedit_train_from_boxes", &b);
+  if (b && !strcmp(outputbase, "-")) {
+    fprintf(stderr, "Box input from stdin not supported in box training.\n");
+    exit(1);
+  }
   if (!renderers.empty()) {
     // Since the PointerVector auto-deletes, null-out the renderers that are
     // added to the root, and leave the root in the vector.

@jbreiden
Copy link
Contributor Author

Pretty good! But even more robust is to locate the lower level function that is crashing
due to bad data. Then modify it to return an error instead of crashing. That protects us
even if it gets called from a different code path.

@jimregan
Copy link
Contributor

I think it's a matter for broader discussion.

On the one hand, it's The Right Thing, and I've already done The Wrong Thing by closing an issue that mentions a segfault... but it's an exceptional case. One, because it overloads what is normally the output file position to be a secondary input, and two, because it's not a frequent use case.

@zdenop
Copy link
Contributor

zdenop commented Jul 29, 2015

I tried it on opensuse 13.2 64bit and it did not crashed:

tesseract testing/phototest.tif - box.train
Page 1
APPLY_BOXES:
Boxes read from boxfile: 225
Found 225 good blobs.
Generated training data for 60 words

Warning in pixReadMemTiff: tiff page 1 not found

ls -t | head -n 1
-.tr

Just OCR to stdout worked as expected:

tesseract testing/phototest.tif -
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Warning in pixReadMemTiff: tiff page 1 not found

@jbreiden
Copy link
Contributor Author

TESSDATA_PREFIX=/usr/share/tesseract-ocr  valgrind api/.libs/lt-tesseract testing/phototest.tif testing/phototest.tif - box.train
==11666== Memcheck, a memory error detector
==11666== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11666== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11666== Command: api/.libs/lt-tesseract testing/phototest.tif testing/phototest.tif - box.train
==11666== 
Tesseract Open Source OCR Engine v3.05.00dev-11-gd937659 with Leptonica
Page 1
==11666== Invalid read of size 8
==11666==    at 0x4FA8E9C: ELIST_ITERATOR::set_to_list(ELIST*) (elst.h:308)
==11666==    by 0x514C108: PAGE_RES_IT::start_page(bool) (pageres.cpp:1510)
==11666==    by 0x4FAA6CE: PAGE_RES_IT::restart_page() (pageres.h:681)
==11666==    by 0x4FAA6AE: PAGE_RES_IT::PAGE_RES_IT(PAGE_RES*) (pageres.h:665)
==11666==    by 0x4FB781E: tesseract::Tesseract::ApplyBoxTraining(STRING const&, PAGE_RES*) (applybox.cpp:797)
==11666==    by 0x4FA2477: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:883)
==11666==    by 0x4FA34D8: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1222)
==11666==    by 0x4FA2D4D: tesseract::TessBaseAPI::ProcessPagesMultipageTiff(unsigned char const*, unsigned long, char const*, char const*, int, tesseract::TessResultRenderer*, int) (baseapi.cpp:1057)
==11666==    by 0x4FA329A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1176)
==11666==    by 0x4FA2DC4: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1074)
==11666==    by 0x403192: main (tesseractmain.cpp:318)
==11666==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==11666== 
==11666== 
==11666== Process terminating with default action of signal 11 (SIGSEGV)
==11666==  Access not within mapped region at address 0x8
==11666==    at 0x4FA8E9C: ELIST_ITERATOR::set_to_list(ELIST*) (elst.h:308)
==11666==    by 0x514C108: PAGE_RES_IT::start_page(bool) (pageres.cpp:1510)
==11666==    by 0x4FAA6CE: PAGE_RES_IT::restart_page() (pageres.h:681)
==11666==    by 0x4FAA6AE: PAGE_RES_IT::PAGE_RES_IT(PAGE_RES*) (pageres.h:665)
==11666==    by 0x4FB781E: tesseract::Tesseract::ApplyBoxTraining(STRING const&, PAGE_RES*) (applybox.cpp:797)
==11666==    by 0x4FA2477: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:883)
==11666==    by 0x4FA34D8: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1222)
==11666==    by 0x4FA2D4D: tesseract::TessBaseAPI::ProcessPagesMultipageTiff(unsigned char const*, unsigned long, char const*, char const*, int, tesseract::TessResultRenderer*, int) (baseapi.cpp:1057)
==11666==    by 0x4FA329A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1176)
==11666==    by 0x4FA2DC4: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1074)
==11666==    by 0x403192: main (tesseractmain.cpp:318)

@zdenop
Copy link
Contributor

zdenop commented Jul 30, 2015

@jbreiden: In openSUSE 13.2 I do not have api/.libs/lt-tesseract just api/.libs/tesseract. And you have two times, so output is testing/phototest.tif.tr

And I got this:

TESSDATA_PREFIX=/usr/share/ valgrind api/.libs/tesseract testing/phototest.tif testing/phototest.tif - box.train
==21845== Memcheck, a memory error detector
==21845== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==21845== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==21845== Command: api/.libs/tesseract testing/phototest.tif testing/phototest.tif - box.train
==21845== 
Tesseract Open Source OCR Engine v3.05.00dev-11-gd937659 with Leptonica
Page 1
APPLY_BOXES:
   Boxes read from boxfile:     225
   Found 225 good blobs.
Generated training data for 60 words
Warning in pixReadMemTiff: tiff page 1 not found
==21845== 
==21845== HEAP SUMMARY:
==21845==     in use at exit: 19,795,264 bytes in 33 blocks
==21845==   total heap usage: 874,639 allocs, 874,606 frees, 60,486,144 bytes allocated
==21845== 
==21845== LEAK SUMMARY:
==21845==    definitely lost: 0 bytes in 0 blocks
==21845==    indirectly lost: 0 bytes in 0 blocks
==21845==      possibly lost: 19,753,408 bytes in 24 blocks
==21845==    still reachable: 41,856 bytes in 9 blocks
==21845==         suppressed: 0 bytes in 0 blocks
==21845== Rerun with --leak-check=full to see details of leaked memory
==21845== 
==21845== For counts of detected and suppressed errors, rerun with: -v
==21845== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

For "stdout version" I got this:

TESSDATA_PREFIX=/usr/share/ valgrind api/.libs/tesseract testing/phototest.tif - box.train
==11238== Memcheck, a memory error detector
==11238== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11238== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==11238== Command: api/.libs/tesseract ../tesseract-ocr/testing/phototest.tif - box.train
==11238== 
Page 1
APPLY_BOXES:
   Boxes read from boxfile:     225
   Found 225 good blobs.
Generated training data for 60 words









Warning in pixReadMemTiff: tiff page 1 not found
==11238== 
==11238== HEAP SUMMARY:
==11238==     in use at exit: 19,795,264 bytes in 33 blocks
==11238==   total heap usage: 874,629 allocs, 874,596 frees, 60,485,260 bytes allocated
==11238== 
==11238== LEAK SUMMARY:
==11238==    definitely lost: 0 bytes in 0 blocks
==11238==    indirectly lost: 0 bytes in 0 blocks
==11238==      possibly lost: 19,753,408 bytes in 24 blocks
==11238==    still reachable: 41,856 bytes in 9 blocks
==11238==         suppressed: 0 bytes in 0 blocks
==11238== Rerun with --leak-check=full to see details of leaked memory
==11238== 
==11238== For counts of detected and suppressed errors, rerun with: -v
==11238== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@amitdo
Copy link
Collaborator

amitdo commented Jan 31, 2016

@jbreiden, please test with latest commit.

@jbreiden
Copy link
Contributor Author

jbreiden commented Feb 3, 2016

make
make install
gdb /usr/local/bin/tesseract
(gdb) run testing/phototest.tif - box.train

Program received signal SIGSEGV, Segmentation fault.
PAGE_RES_IT::start_page (this=this@entry=0x7fffffffde10, empty_ok=empty_ok@entry=false) at pageres.cpp:1510
1510      block_res_it.set_to_list(&page_res->block_res_list);
(gdb) backtrace
#0  PAGE_RES_IT::start_page (this=this@entry=0x7fffffffde10, empty_ok=empty_ok@entry=false) at pageres.cpp:1510
#1  0x00007ffff76e6f29 in restart_page (this=0x7fffffffde10) at ../ccstruct/pageres.h:681
#2  PAGE_RES_IT (the_page_res=<optimized out>, this=0x7fffffffde10) at ../ccstruct/pageres.h:665
#3  tesseract::Tesseract::ApplyBoxTraining (this=0x819810, fontname=..., page_res=<optimized out>) at applybox.cpp:797
#4  0x00007ffff76dd926 in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffffffe450, monitor=monitor@entry=0x0) at baseapi.cpp:883
#5  0x00007ffff76ddc2a in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffe450, pix=0x84fd10, page_index=<optimized out>, filename=<optimized out>, 
    retry_config=0x0, timeout_millisec=0, renderer=0x0) at baseapi.cpp:1224
#6  0x00007ffff76de10b in tesseract::TessBaseAPI::ProcessPagesMultipageTiff (this=0x7fffffffe450, data=0x0, data@entry=0x13a0828 "II*", size=8, filename=0x0, 
    filename@entry=0x7fffffffe85a "testing/phototest.tif", retry_config=retry_config@entry=0x0, timeout_millisec=20909344, timeout_millisec@entry=0, 
    renderer=0x0, tessedit_page_number=-1) at baseapi.cpp:1057
#7  0x00007ffff76de5fe in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffffffe450, filename=<optimized out>, 
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x0) at baseapi.cpp:1176
#8  0x00007ffff76dea40 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffe450, filename=<optimized out>, 
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1074
#9  0x0000000000401dff in main (argc=<optimized out>, argv=0x7fffffffe5e8) at tesseractmain.cpp:432

@amitdo
Copy link
Collaborator

amitdo commented Feb 4, 2016

I can reproduce this.

I reread this issue. Jim's explanation is still true.

box.train needs both an image, and a box file, and from those it creates training data. For a more complete explanation, see the wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#run-tesseract-for-training

@jbreiden
Copy link
Contributor Author

jbreiden commented Feb 4, 2016

I'd suggest something like this. I didn't check to see if we leak memory
if we go down this error path, but no matter what it is better than a segfault.

--- baseapi.cpp.orig    2016-02-04 01:09:07.790101916 +0000
+++ baseapi.cpp 2016-02-04 01:07:15.464620603 +0000
@@ -851,6 +851,9 @@
     page_res_ = new PAGE_RES(false,
                              block_list_, &tesseract_->prev_word_best_choice_);
   }
+  if (page_res_ == NULL) {
+    return -1;
+  }
   if (tesseract_->tessedit_make_boxes_from_boxes) {
     tesseract_->CorrectClassifyWords(page_res_);
     return 0;

@amitdo
Copy link
Collaborator

amitdo commented Feb 4, 2016

Tested. It works - no segfault.

@zdenop zdenop closed this as completed in b226275 Mar 4, 2016
zdenop added a commit that referenced this issue Mar 8, 2016
@amitdo amitdo added the bug label May 27, 2016
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants