[Android] PdfRenderer can't include JPG files to PDFs produced on Android (PNG files works) #3317

Robyer · 2021-03-02T20:12:41Z

I've got report from user (adaptech-cz/Tesseract4Android#31) that Tesseract4Android library (which uses Tesseract 4.x/5.x) produces corrupted PDFs for JPG files (PNG files works), but on tess-two library (which uses Tesseract 3.x.x) it works correctly for JPG files.

After some debugging I found that difference/problem is probably caused by commit f794571.

If I let the code always go through the first branch sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid); for both PNG and JPG then it produces PDF files correctly.
But if the code goes through second branch with sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid); then that call always fails.

The reason why the second branch fails is that Leptonica tries to load the file via fmemopen but that is not available on Android, so as work-around it tries to first write the data to temporary file via tmpfile() and then process that file. But on Android is /tmp directory not available and thus every attempt to use tmpfile() fails. As a result it can never process such file and produces PDF without the image data.

(note: fmemopen support was added to Android NDK starting with API 23, but I'd like to keep supporting older API as original libraries while possible)

Quick fix would be always using the first branch (in mentioned code above) on Android via preprocessor macro. But I'm not familiar with PDF format and I don't know what that L_FLATE_ENCODE method exactly does.

Should I provide PR for this change or is such work-around a bad idea?

The text was updated successfully, but these errors were encountered:

zdenop · 2021-03-03T07:37:03Z

It is a bad idea:

If you have no clue what are you doing, usually you case other problems.
If problem is in leptonica, it should be fixed there. Maybe @DanBloomberg can give some solution for old Android version.
L_FLATE_ENCODE indicates compression of pdf images/streams.

So if I understand it right: you take jpeg as input and you wish to store it as png(like) image format in pdf. If this is desired workaround for you, you can do it in your app instead of modifying tesseract.

Robyer · 2021-03-03T13:25:38Z

It is a bad idea:

If you have no clue what are you doing, usually you case other problems.

Of course, but in this case I just used code that existed in this repository for some time before it was changed by commit f794571.

Reason for that commit was fixing issue named "jpg input files result in much bigger pdf". So I concluded that probably worst thing that could happen by reverting it (only for Android) is producing bigger PDF files, but at least it will work again (on Android). That's why I mentioned it in this issue as a possible solution/work-around that I have found. I just don't know if it's correct one.

So if I understand it right: you take jpeg as input and you wish to store it as png(like) image format in pdf. If this is desired workaround for you, you can do it in your app instead of modifying tesseract.

I want to take JPEG file as input and produce PDF. But currently the produced PDF doesn't contain the JPEG image at all (and shows error when opening the file in PDF reader).
When I take PNG file as input instead, it produces PDF file correctly.
When I revert above mentioned commit (f794571), it starts producing PDF files correctly for JPEG files.

DanBloomberg · 2021-03-03T22:08:48Z

I made this comment on the commit from 2 years ago -- it should have been here:

This is a leptonica issue only to the extent that leptonica does not allow a direct to memory jpeg encoding of pix raster images, but instead requires encoding into a temporary file.

Because that implementation is not likely to happen anytime soon, and as fmemopen() is now available on android, there is not much incentive.

The simplest work-around without changing code is, as @zdenop mentioned, to convert the images to png before making the pdf.

You can do this in leptonica, using either:

pixWriteMemPng() to write the pix to a buffer in png compressed format, or
pixWritePng() to write the pix to a file in png format

However:::
Looking at the code for l_generateJpegData(), I can refactor this to eliminate l_generateJpegDataMem(),
and thereby avoid the problematic fopenReadFromMemory() call.

This leaves me with one problem: in pixcompFastConvertToPdfData(), we avoid transcoding because
pixc is in compressed jpeg form in memory. For that, l_generateJpegDataMem() is the needed function.

I have implemented the change, and it is now pushed.
Let me know if this fixes your problem.

Dan

amitdo · 2021-03-04T00:33:54Z

Dan, I see you use _open().

https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/open-wopen?view=msvc-160

Opens a file. These functions are deprecated because more-secure versions are available; see _sopen_s, _wsopen_s.

Robyer · 2021-03-04T20:57:24Z

@DanBloomberg Thank you! Your change fixes the issue.

See discussion at tesseract-ocr/tesseract#3317

Robyer closed this as completed Mar 4, 2021

Robyer added a commit to adaptech-cz/Tesseract4Android that referenced this issue Mar 4, 2021

Correct fix for issue #31

757271e

See discussion at tesseract-ocr/tesseract#3317

amitdo added leptonica PDF labels Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Android] PdfRenderer can't include JPG files to PDFs produced on Android (PNG files works) #3317

[Android] PdfRenderer can't include JPG files to PDFs produced on Android (PNG files works) #3317

Robyer commented Mar 2, 2021

zdenop commented Mar 3, 2021

Robyer commented Mar 3, 2021

DanBloomberg commented Mar 3, 2021

amitdo commented Mar 4, 2021

Robyer commented Mar 4, 2021

[Android] PdfRenderer can't include JPG files to PDFs produced on Android (PNG files works) #3317

[Android] PdfRenderer can't include JPG files to PDFs produced on Android (PNG files works) #3317

Comments

Robyer commented Mar 2, 2021

zdenop commented Mar 3, 2021

Robyer commented Mar 3, 2021

DanBloomberg commented Mar 3, 2021

amitdo commented Mar 4, 2021

Robyer commented Mar 4, 2021