Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF output mangling image for TIFF input #535

Closed
jbreiden opened this issue Dec 5, 2016 · 8 comments
Closed

PDF output mangling image for TIFF input #535

jbreiden opened this issue Dec 5, 2016 · 8 comments
Labels

Comments

@jbreiden
Copy link
Contributor

jbreiden commented Dec 5, 2016

This means api->GetInputImage() is giving us a processed image.

test.tif.zip
test.pdf

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

Emergency workaround while I go hunt down root cause.

--- tesseract/api/pdfrenderer.cpp	2016-11-21 08:45:47.000000000 -0800
+++ tesseract/api/pdfrenderer.cpp	2016-12-05 14:15:42.000000000 -0800
@@ -841,8 +841,8 @@
 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
   size_t n;
   char buf[kBasicBufSize];
-  Pix *pix = api->GetInputImage();
   char *filename = (char *)api->GetInputName();
+  Pix *pix = pixRead(filename);
   int ppi = api->GetSourceYResolution();
   if (!pix || ppi <= 0)
     return false;

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

This change also does it, at the cost of memory. And probably leaks.

--- tesseract/api/baseapi.cpp	2016-12-05 08:51:32.000000000 -0800
+++ tesseract/api/baseapi.cpp	2016-12-05 14:47:16.000000000 -0800
@@ -523,7 +523,7 @@
   if (InternalSetImage()) {
     thresholder_->SetImage(imagedata, width, height,
                            bytes_per_pixel, bytes_per_line);
-    SetInputImage(thresholder_->GetPixRect());
+    SetInputImage(pixCopy(NULL, thresholder_->GetPixRect()));
   }
 }
 
@@ -545,7 +545,7 @@
 void TessBaseAPI::SetImage(Pix* pix) {
   if (InternalSetImage()) {
     thresholder_->SetImage(pix);
-    SetInputImage(thresholder_->GetPixRect());
+    SetInputImage(pixCopy(NULL, thresholder_->GetPixRect()));
   }
 }

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

This one is probably best.

--- tesseract/ccmain/thresholder.cpp	2016-03-11 14:29:36.000000000 -0800
+++ tesseract/ccmain/thresholder.cpp	2016-12-05 15:00:46.000000000 -0800
@@ -225,7 +225,7 @@
 Pix* ImageThresholder::GetPixRect() {
   if (IsFullImage()) {
     // Just clone the whole thing.
-    return pixClone(pix_);
+    return pixCopy(pix_);
   } else {
     // Crop to the given rectangle.
     Box* box = boxCreate(rect_left_, rect_top_, rect_width_, rect_height_);
@@ -322,4 +322,3 @@
 }
 
 }  // namespace tesseract.
-

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

This bug happens when:

  • input image is binary, which causes up to corrupt api->GetInputImage()
  • and not (JPEG2000 || JPEG || PNG), which causes us to utilize api->GetInputImage()

So for example, this example is TIFF G4. Converting to an identical looking TIFF LZW
grayscale does not tickle this bug.

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

Ray found the exact spot. This is the final answer.

--- tesseract/ccmain/thresholder.cpp	2016-03-11 14:29:36.000000000 -0800
+++ tesseract/ccmain/thresholder.cpp	2016-12-05 15:27:45.000000000 -0800
@@ -181,8 +181,9 @@
 // Caller must use pixDestroy to free the created Pix.
 void ImageThresholder::ThresholdToPix(PageSegMode pageseg_mode, Pix** pix) {
   if (pix_channels_ == 0) {
-    // We have a binary image, so it just has to be cloned.
-    *pix = GetPixRect();
+    // We have a binary image, so it just has to be copied.
+    // Don't clone or you'll mess up api->GetInputImage()
+    *pix = pixCopy(NULL, GetPixRect());
   } else {
     OtsuThresholdRectToPix(pix_, pix);
   }
@@ -322,4 +323,3 @@
 }
 
 }  // namespace tesseract.
-

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

Note that this bug affects all versions of Tesseract capable of producing PDF output, both 3.0.x and 4.x.

@jbreiden
Copy link
Contributor Author

jbreiden commented Dec 5, 2016

... And the code above is leaky. Ray is doing the final final final version right now.

@theraysmith
Copy link
Contributor

Fixed in commit 7744da9..025689f.

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants