Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results when same image is lossless-encoded at different bpp #242

Closed
bruvi opened this issue Feb 28, 2016 · 18 comments
Closed

different results when same image is lossless-encoded at different bpp #242

bruvi opened this issue Feb 28, 2016 · 18 comments
Assignees

Comments

@bruvi
Copy link

bruvi commented Feb 28, 2016

Hi,
I noticed different results when analyzing the same PNG image having 8 different colors when it is encoded with a colormap of 8 or 256 colors.

Version:

tesseract 3.05.00dev
 leptonica-1.73
  libjpeg 8d : libpng 1.2.49 : libtiff 4.0.3 : zlib 1.2.8

Demo:
french subtitle conversion from XFiles-S01E01 (DVB). The subtitle image extracted with ProjectX as a BMP has 8 distinct colors. It was converted to PNG with two different programs.
'ok.png', 256 entries colormap:
ok
Tesseract result is correct:

$ tesseract -l fra ok.png stdout
La mort a eu lieu
il y a 8 à 12 heures.

'nak.png' 8 entries colormap:
nak

Tesseract result is wrong:

$ tesseract -l fra nak.png stdout
La morte eu lieu
il y a 8 à 12 heures.

The image type is either recognized as 4 or 8 bpp but the information content is identical:

$ pngtopnm ok.png|md5sum
cbe58b4e03ebbe279958287e8d53f719  -
$ pngtopnm nak.png|md5sum
cbe58b4e03ebbe279958287e8d53f719  -

I suspect that tesseract internal processing keeps the original image bit depth and that some steps don't work as well at 4bpp as at 8bpp.

External workaround:

pngtopnm nak.png | tesseract -l fra stdin stdout
La mort a eu lieu
il y a 8 à 12 heures.

Quick patch to do this conversion internally in libtesseract:

 --- api/baseapi.cpp.orig   2016-02-18 08:48:00.000000000 +0100
+++ api/baseapi.cpp 2016-02-28 13:41:08.056037716 +0100
@@ -1195,6 +1195,12 @@
                               const char* retry_config, int timeout_millisec,
                               TessResultRenderer* renderer) {
   PERF_COUNT_START("ProcessPage")
+  // convert 4bit to 8bit data for improved processing
+  if (pix->d==4) {
+    Pix* pix2=pixConvert4To8(pix,(pix->colormap)!=NULL);
+    pixDestroy(&pix);
+    pix=pix2;
+  }
   SetInputName(filename);
   SetImage(pix);
   bool failed = false;

I am not sure if this is the correct strategy to deal with this issue, nor if it is the right place to change the data type. 4-color DVB subtitles exist, so 2bpp images should probably also be considered.

Thanks for this soft, works great for me.
-- Bruno

@amitdo
Copy link
Collaborator

amitdo commented Feb 28, 2016

https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/thresholder.cpp#L54

// SetImage makes a copy of all the image data, so it may be deleted
// immediately after this call.
// Greyscale of 8 and color of 24 or 32 bits per pixel may be given.
// Palette color images will not work properly and must be converted to
// 24 bit.
// Binary images of 1 bit per pixel may also be given but they must be
// byte packed with the MSB of the first byte being the first pixel, and a
// one pixel is WHITE. For binary images set bytes_per_pixel=0.
void ImageThresholder::SetImage(const unsigned char* imagedata,
                                int width, int height,
                                int bytes_per_pixel, int bytes_per_line) {

@bruvi
Copy link
Author

bruvi commented Feb 29, 2016

Mmm yes...

Well two approaches from my point of view:

  1. the user has to deal with the image format, but it should be mentioned in the man page and a warning should be issued when a palette color image is detected. Maybe the image should just be rejected.

or

  1. tesseract should do its best with the data it receives and convert them to the internal format it needs for optimal processing.

I really think something should be done, because people will just (wrongly) conclude that tesseract is bad at reading a really clear-looking picture...

( I was very close to that conclusion but I was saved by the fact that previous trials with much lower quality pictures (vobsub output of ProjectX) had given the correct result for that precise subtitle. But other errors in about 10% of subtitles.)

@amitdo
Copy link
Collaborator

amitdo commented Feb 29, 2016

I totally agree...
:-)

@amitdo
Copy link
Collaborator

amitdo commented Feb 29, 2016

Oops I made a mistake.

The relevant method takes 'const Pix*'.
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/thresholder.cpp#L143

My previous comment linked to an overloaded method which takes 'const unsigned char*'.

// Pix vs raw, which to use? Pix is the preferred input for efficiency,
// since raw buffers are copied.
// SetImage for Pix clones its input, so the source pix may be pixDestroyed
// immediately after, but may not go away until after the Thresholder has
// finished with it.
void ImageThresholder::SetImage(const Pix* pix) {
  if (pix_ != NULL)
    pixDestroy(&pix_);
  Pix* src = const_cast<Pix*>(pix);
  int depth;
  pixGetDimensions(src, &image_width_, &image_height_, &depth);
  // Convert the image as necessary so it is one of binary, plain RGB, or
  // 8 bit with no colormap.
  if (depth > 1 && depth < 8) {
    pix_ = pixConvertTo8(src, false);
  } else if (pixGetColormap(src)) {
    pix_ = pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC);
  } else {
    pix_ = pixClone(src);
  }
  depth = pixGetDepth(pix_);
  pix_channels_ = depth / 8;
  pix_wpl_ = pixGetWpl(pix_);
  scale_ = 1;
  estimated_res_ = yres_ = pixGetYRes(src);
  Init();
}

@bruvi
Copy link
Author

bruvi commented Mar 2, 2016

The mystery thickens...

The above function seems to be correctly written to deal with colormaps:

  • 1st case (depth > 1 && depth < 8) : the image is either a low bit-depth greyscale or a colormapped image with at most 128 color indexes (well probably 16 colors in real cases). Then the image is converted to a 256-level gray-scale image (pixConvertTo8(...,false))
  • 2nd case: the bit depth is at least 8 and the image has a colormap. As more than 8 bit per channel images are not handled by leptonica, it is either a 256 indexes grayscale image or a 256 indexes RGB image. pixRemoveColormap produces a non-indexed 8bit grayscale or 24bit RGB image.
  • last case: at least 8bpp and no colormap, so it is a non-indexed 8bit grayscale or 24bit RGB image.

So the image is now a 8bit per channel raster with no colormap, it seems fit for thresholding.

I checked which way followed both my test images, and I extracted the thresholded image:

--- api/baseapi.cpp.orig    2016-02-18 08:48:00.000000000 +0100
+++ api/baseapi.cpp 2016-03-01 23:37:26.247132277 +0100
@@ -2151,6 +2152,9 @@
   }
   tesseract_->set_source_resolution(estimated_res);
   SavePixForCrash(estimated_res, *pix);
+pixWriteTiff("thres.tiff", *pix, IFF_TIFF,"w");
 }

The 'ok.png' fits the second case (8bit palette) and exits this function as 24bit RGB (well pixGetDepth(..)=32), here is the thresholded image:
ok-thresholded

The 'nak.png" fits the first case (4bit palette) and exits this function as a 8bit grayscale. Result of threshold:
nak-thresholded

Well, the letters in the second image are thinner, but in the first image there are some small peaks under the letters... And for the only letter that is not correctly recognized in these lines, the middle 'a' of the first line, I'm really not sure which version I prefer...

So:

  • the result differs if the color to gray conversion is done here or further in the threshold process.
  • in my case it is best that it is not done here, probably because it gives too thin letters
  • but maybe that with bold letters the "good" and "bad' cases would be inverted ?

For the moment I can't see a clear flaw in the design of this function and I am not sure of what is the correct strategy in the general case.
What bugs me is that the results differ according to image encoding and not image content, because it triggers different processing paths.
If we can show that keeping the image RGB at this stage is better, then this function should be corrected. If depends on the type of image, maybe there should be a command line option (--subtitles) ?

After some thoughts, as it concerns only color palette images, it will probably only concern subtitles and screen dumps (who would scan a page or take a picture in a gif-like format ?). So maybe "my case" is the "general case"

Off to bed...

@amitdo
Copy link
Collaborator

amitdo commented Mar 3, 2016

I converted your nak.png to rgb and gray images using GIMP.
242-bad-gimp-rgb
242-bad-gimp-gray

tesseract -l fra bad-gimp-rgb.png stdout
La mort a eu lieu
il y a 8 à 12 heures.

tesseract -l fra 242-bad-gimp-gray.png stdout
La mort a eu lieu
il y a 8 à 12 heures.

@amitdo
Copy link
Collaborator

amitdo commented Mar 3, 2016

Oh, I missed this part:
pngtopnm nak.png | tesseract -l fra stdin stdout

@amitdo
Copy link
Collaborator

amitdo commented Mar 3, 2016

tesseract -l fra 242-bad.png stdout -psm 11
La mort a eu lieu

il y a 8 à 12 heures.

https://github.com/tesseract-ocr/tesseract/blob/master/api/tesseractmain.cpp#L97

@bruvi
Copy link
Author

bruvi commented Mar 3, 2016

Well, I did some digging in the image path before character recognition.

I think it all comes from the dark-blue background.

If the image is converted to gray-levels before thresholding (16 color palette image), the threshold is determined via gray-histogram study, and is set to 98
If the image is thresholded as a RGB picture, there are 3 histograms, giving 3 thresholds: 64(R), 64(G) and 132(B)
When the image is binarized (?), if it is an RGB picture, the pixels R,G,B values are tested against these thresholds, in that order. As soon as one of the 3 value is above its threshold, the pixel is considered part of a character.
In this pictures, the characters are outlined in dark gray (31,31,31) the light gray (64,64,64).
The dark-gray is always below thresholds, so it is considered background.
The light-gray is equal to the red threshold, so it is considered part of the character when red is tested in first -> thick letters in RGB mode
But it is below the gray-level threshold (98), so it is considered part of the background -> thin letters in gray-mode.

@bruvi
Copy link
Author

bruvi commented Mar 3, 2016

I'm not sure if it is a good idea to set a priority in the RGB thresholds so that red is tested before green, and then only blue. This priority seems only chosen by the classical RGB color order:
(from ccmain/thresholder.cpp)

      bool white_result = true;
      for (int ch = 0; ch < num_channels; ++ch) {
        int pixel = GET_DATA_BYTE(const_cast<void*>(
                                  reinterpret_cast<const void *>(linedata)),
                                  (x + rect_left_) * num_channels + ch);
        if (hi_values[ch] >= 0 &&
            (pixel > thresholds[ch]) == (hi_values[ch] == 0)) {
          white_result = false;
          break;
        }
      }

In fact the algorithm even calculates the alpha channel histogram, and if the result was not null, it would test the alpha value of the pixel against this threshold.
I think that this is just a waste of time.

Anyway, as for now, tesseract seems to do arbitrary choices in the process of color-images binarization:

  • if the image has a small palette it is first converted to gray
  • this gray conversion seems to use a more or less random formula: gray = 0.25 R + 0.5 G + 0.25 B
  • the first plane which claims to have detected a character wins the test (there could be a vote between the 3)
  • alpha gets treated in this process. The only sensible way to treat alpha would be IMHO to treat the pixel as background if alpha is small, so it would have to be tested first, and if small, prevent testing of other planes.

Maybe the sensible advice to users experiencing bad results with color images should be to do the gray conversion by themselves. There are many possible choices in this process: GIMP has tree options in the 'desaturate' dialog; my old scanner was just using the green channel when scanning in grey (as do many copiers). Maybe one of CMYK layers could be good in some cases...

In the case of these subtitles where, who knows why, ProjectX added a dark-blue background, selecting the red or green plane gives the best contrast between text and background, and leads to good results (which happens by chance on 'ok.png', would'nt work if ProjectX had chosen red background)

@amitdo
Copy link
Collaborator

amitdo commented Mar 4, 2016

Nice analysis!

https://github.com/tesseract-ocr/tesseract/blob/master/README.md
The README includes this sentence:

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

So maybe the ImproveQuality wiki page is the right place to add your advice. You can edit the wiki yourself. The explanation should be as clear and short as possible.

@jbreiden
Copy link
Contributor

jbreiden commented Mar 9, 2016

The 0.25 R + 0.5 G + 0.25 B calculation designed to give fast results, not accurate ones. Sounds like someone may be calling pixConvertRGBToGrayFast() instead of pixConvertRGBToGray(). This seems like a bad idea if it is affecting results. And I also don't understand why we would ever threshold on R,G,B individually. The normal thing to do is calculate luminance, then threshold on luminance.

Leptonica is really, really good at image binarization. We should be making use of it.

@jbreiden
Copy link
Contributor

jbreiden commented Mar 9, 2016

@bruvi where exactly is the 0.25 R + 0.5 G + 0.25 B conversion happening?

@bruvi
Copy link
Author

bruvi commented Mar 9, 2016

It is in fact in leptonica.
In ccmain/thresholder.cpp, Tesseract calls leptonica's pixConvertTo8 when the input image has less than 8 bpp (or a colormap with few colors, typically 16):

  if (depth > 1 && depth < 8) {
    pix_ = pixConvertTo8(src, false);

For indexed-color images, pixConvertTo8 calls pixRemoveColormap (see [https://tpgit.github.io/Leptonica/pixconv_8c_source.html] ) with type = REMOVE_CMAP_TO_GRAYSCALE
In that case, the colormap is converted to gray with this code:

for (i = 0; i < pixcmapGetCount(cmap); i++) {
      graymap[i] = (rmap[i] + 2 * gmap[i] + bmap[i]) / 4;
}

So it is leptonica which uses this quick formula, although speed should not really be a concern here as it is applied just to the colormap, not the full image.

Sorry, I don't have much time this week to follow this thread.
Thanks for the interest.

@bruvi bruvi closed this as completed Mar 9, 2016
@bruvi bruvi reopened this Mar 9, 2016
@jbreiden
Copy link
Contributor

If we use sRGB perceptual weightings then it works.

graymap[i] = 0.2126 * rmap[i] + 0.7152 * gmap[i] + 0.0722 * bmap[i];

tesseract -l fra /tmp/nak.png -
La mort a eu lieu
il y a 8 à 12 heures.

If we use "Leptonica" perceptual weightings, then it fails.

graymap[i] = 0.3 * rmap[i] + 0.5 * gmap[i] + 0.2 * bmap[i];

tesseract -l fra /tmp/nak.png -
La morte eu lieu
il y a 8 à 12 heures.

If I remove the colormap before anything else, then it works. I think this is the right solution because it guarantees consistency (e.g. colormap vs. non-colormap will not change results if image is otherwise identical). Leptonica should switch to some sort of perceptual weighting for the next release because that is a no-brainer, but will not have any effect on Tesseract if we remove the colormap first.

--- tesseract/ccmain/thresholder.cpp    2014-07-11 11:28:02.000000000 -0700
+++ tesseract/ccmain/thresholder.cpp    2016-03-11 10:01:36.000000000 -0800
@@ -149,17 +149,23 @@
   if (pix_ != NULL)
     pixDestroy(&pix_);
   Pix* src = const_cast<Pix*>(pix);
-  int depth;
-  pixGetDimensions(src, &image_width_, &image_height_, &depth);
   // Convert the image as necessary so it is one of binary, plain RGB, or
   // 8 bit with no colormap.
+  Pix *tmp;
+  if (pixGetColormap(src)) {
+    tmp = pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC);
+  } else {
+    tmp = pixClone(src);
+  }
+  int depth;
+  pixGetDimensions(tmp, &image_width_, &image_height_, &depth);
+
   if (depth > 1 && depth < 8) {
-    pix_ = pixConvertTo8(src, false);
-  } else if (pixGetColormap(src)) {
-    pix_ = pixRemoveColormap(src, REMOVE_CMAP_BASED_ON_SRC);
+    pix_ = pixConvertTo8(tmp, false);
   } else {
-    pix_ = pixClone(src);
+    pix_ = pixClone(tmp);
   }
+  pixDestroy(&tmp);
   depth = pixGetDepth(pix_);
   pix_channels_ = depth / 8;
   pix_wpl_ = pixGetWpl(pix_);
tesseract -l fra /tmp/nak.png -
La mort a eu lieu
il y a 8 à 12 heures.

@zdenop
Copy link
Contributor

zdenop commented Mar 11, 2016

@jbreiden: will you make a PR or should I commit patch from above?

@jbreiden
Copy link
Contributor

Ray is going to write the official change. Best to coordinate with him. He is doing the same thing, but with a different way of writing the code.

@zdenop
Copy link
Contributor

zdenop commented Nov 24, 2016

patch committed with c1c1e42

@zdenop zdenop closed this as completed Nov 24, 2016
zdenop added a commit that referenced this issue Nov 25, 2016
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…same image is lossless-encoded at different bpp
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…same image is lossless-encoded at different bpp
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…same image is lossless-encoded at different bpp
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
…same image is lossless-encoded at different bpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants