Skip to content

Commit

Permalink
Allow l_generateCIDataForPdf() to work with jpeg files without
Browse files Browse the repository at this point in the history
transcoding and without requiring reading jpeg data from memory,
or writing temp files.  (Request for tesseract 3.x on android
that doesn't have fmemopen() and can't write temp files.)
  • Loading branch information
DanBloomberg committed Mar 3, 2021
1 parent 7b453b5 commit aedf815
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 24 deletions.
28 changes: 16 additions & 12 deletions src/adaptmap.c
Original file line number Diff line number Diff line change
Expand Up @@ -274,12 +274,12 @@ pixBackgroundNormSimple(PIX *pixs,
* Notes:
* (1) This is a top-level interface for normalizing the image intensity
* by mapping the image so that the background is near the input
* value 'bgval'.
* value %bgval.
* (2) The input image is either grayscale or rgb.
* (3) For each component in the input image, the background value
* in each tile is estimated using the values in the tile that
* are not part of the foreground, where the foreground is
* determined by the input 'thresh' argument.
* determined by %thresh.
* (4) An optional binary mask can be specified, with the foreground
* pixels typically over image regions. The resulting background
* map values will be determined by surrounding pixels that are
Expand All @@ -293,25 +293,29 @@ pixBackgroundNormSimple(PIX *pixs,
* grayscale version can be used elsewhere. If the input is RGB
* and this is not supplied, it is made internally using only
* the green component, and destroyed after use.
* (6) The dimensions of the pixel tile (sx, sy) give the amount by
* (6) The dimensions of the pixel tile (%sx, %sy) give the amount by
* by which the map is reduced in size from the input image.
* (7) The threshold is used to binarize the input image, in order to
* (7) The input image is binarized using %thresh, in order to
* locate the foreground components. If this is set too low,
* some actual foreground may be used to determine the maps;
* if set too high, there may not be enough background
* to determine the map values accurately. Typically, it's
* to determine the map values accurately. Typically, it is
* better to err by setting the threshold too high.
* (8) A 'mincount' threshold is a minimum count of pixels in a
* (8) A %mincount threshold is a minimum count of pixels in a
* tile for which a background reading is made, in order for that
* pixel in the map to be valid. This number should perhaps be
* at least 1/3 the size of the tile.
* (9) A 'bgval' target background value for the normalized image. This
* (9) A %bgval target background value for the normalized image. This
* should be at least 128. If set too close to 255, some
* clipping will occur in the result.
* (10) Two factors, 'smoothx' and 'smoothy', are input for smoothing
* the map. Each low-pass filter kernel dimension is
* is 2 * (smoothing factor) + 1, so a
* value of 0 means no smoothing. A value of 1 or 2 is recommended.
* clipping will occur in the result. It is recommended to use
* %bgval = 200.
* (10) Two factors, %smoothx and %smoothy, are input for smoothing
* the map. Each low-pass filter kernel dimension is
* is 2 * (smoothing factor) + 1, so a
* value of 0 means no smoothing. A value of 1 or 2 is recommended.
* (11) See pixCleanBackgroundToWhite(). The recommended value for %bgval
* is 200. As done there, pixBackgroundNorm() is typically followed
* by pixGammaTRC(), where the maxval must not not exceed %bgval.
* </pre>
*/
PIX *
Expand Down
59 changes: 49 additions & 10 deletions src/pdfio2.c
Original file line number Diff line number Diff line change
Expand Up @@ -808,43 +808,82 @@ PIXCMAP *cmap = NULL;
* ~ 0 for binary data (not permitted in PostScript)
* ~ 1 for ascii85 (5 for 4) encoded binary data
* (not permitted in pdf)
* (2) Do not free the data. l_generateJpegDataMem() will free
* the data if the data is invalid, or if it does not use
* ascii encoding.
* (2) Most of this function is repeated in l_generateJpegMemData(),
* which is required in pixacompFastConvertToPdfData().
* </pre>
*/
L_COMP_DATA *
l_generateJpegData(const char *fname,
l_int32 ascii85flag)
{
char *data85 = NULL; /* ascii85 encoded jpeg compressed file */
l_uint8 *data = NULL;
size_t nbytes;
l_int32 w, h, xres, yres, bps, spp;
size_t nbytes, nbytes85;
L_COMP_DATA *cid;
FILE *fp;

PROCNAME("l_generateJpegData");

if (!fname)
return (L_COMP_DATA *)ERROR_PTR("fname not defined", procName, NULL);

/* The returned jpeg data in memory is the entire jpeg file,
* which starts with ffd8 and ends with ffd9 */
/* Read the metadata */
if (readHeaderJpeg(fname, &w, &h, &spp, NULL, NULL))
return (L_COMP_DATA *)ERROR_PTR("bad jpeg metadata", procName, NULL);
bps = 8;
if ((fp = fopenReadStream(fname)) == NULL)
return (L_COMP_DATA *)ERROR_PTR("stream not opened", procName, NULL);
fgetJpegResolution(fp, &xres, &yres);
fclose(fp);

/* Read the entire jpeg file. The returned jpeg data in memory
* starts with ffd8 and ends with ffd9 */
if ((data = l_binaryRead(fname, &nbytes)) == NULL)
return (L_COMP_DATA *)ERROR_PTR("data not extracted", procName, NULL);

return l_generateJpegDataMem(data, nbytes, ascii85flag);
/* Optionally, encode the compressed data */
if (ascii85flag == 1) {
data85 = encodeAscii85(data, nbytes, &nbytes85);
LEPT_FREE(data);
if (!data85)
return (L_COMP_DATA *)ERROR_PTR("data85 not made", procName, NULL);
else
data85[nbytes85 - 1] = '\0'; /* remove the newline */
}

cid = (L_COMP_DATA *)LEPT_CALLOC(1, sizeof(L_COMP_DATA));
if (ascii85flag == 0) {
cid->datacomp = data;
} else { /* ascii85 */
cid->data85 = data85;
cid->nbytes85 = nbytes85;
}
cid->type = L_JPEG_ENCODE;
cid->nbytescomp = nbytes;
cid->w = w;
cid->h = h;
cid->bps = bps;
cid->spp = spp;
cid->res = xres;
return cid;
}


/*!
* \brief l_generateJpegDataMem()
*
* \param[in] data of jpeg file
* \param[in] nbytes of jpeg file
* \param[in] data of jpeg-encoded file
* \param[in] nbytes size of jpeg-encoded file
* \param[in] ascii85flag 0 for jpeg; 1 for ascii85-encoded jpeg
* \return cid containing jpeg data, or NULL on error
*
* <pre>
* Notes:
* (1) See l_generateJpegData().
* (1) Set ascii85flag:
* ~ 0 for binary data (not permitted in PostScript)
* ~ 1 for ascii85 (5 for 4) encoded binary data
* (not permitted in pdf)
* </pre>
*/
L_COMP_DATA *
Expand Down
3 changes: 1 addition & 2 deletions src/pix.h
Original file line number Diff line number Diff line change
Expand Up @@ -423,8 +423,7 @@ enum {
*
* (6) The version numbers (below) are used in the serialization
* of these data structures. They are placed in the files,
* and rarely (if ever) change. Provision is currently made for
* backward compatibility in reading from boxaa version 2.
* and rarely (if ever) change.
*
* (7) The serialization dependencies are as follows:
* pixaa : pixa : boxa
Expand Down

0 comments on commit aedf815

Please sign in to comment.