Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leptonica 1.83.0 breaks tesseract, which in turn breaks pdfsandwich #659

Closed
swsch opened this issue Jan 24, 2023 · 10 comments
Closed

Leptonica 1.83.0 breaks tesseract, which in turn breaks pdfsandwich #659

swsch opened this issue Jan 24, 2023 · 10 comments

Comments

@swsch
Copy link

swsch commented Jan 24, 2023

Greetings.

After updating a Gentoo box to leptonica 1.83.0, pdfsandwich stopped working. Some experimenting let me pinpoint the problem with leptonica, as you can see in the bug report I filed as #891833 in gentoo's bugzilla.

In short: the same install of pdfsandwich and tesseract fails with leptonica 1.83.0 while it works with 1.82.0.

The relevant parts of pdfsandwich's verbose output:

# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
Version: ImageMagick 7.1.0-48 Q16 x86_64 20449 https://imagemagick.org/
Compiler: gcc (12.2)
unpaper 7.0.0
tesseract 5.3.0
 leptonica-1.83.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
 Found OpenMP 201511
 Found libarchive 3.6.1 zlib/1.2.13 liblzma/5.2.9 bz2lib/1.0.8
 Found libcurl/7.87.0 OpenSSL/1.1.1s zlib/1.2.13 libidn2/2.3.4 nghttp2/1.51.0
GPL Ghostscript 10.00.0 (2022-09-21)
pdfinfo version 23.01.0
pdfunite version 23.01.0
...
Input file: "20230116_095121_3.pdf"
Output file: "test.pdf"
Number of pages in inputfile: 1
More threads than pages. Using 1 threads instead.
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]"
convert -units PixelsPerInch  -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300  "/tmp/pdfsandwich_tmp7852e4/pdfsandwich_inputfileb08248.pdf[0]" /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp7852e4/pdfsandwich214f21.pgm -> /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm
[pgm_pipe @ 0x55b31216f9c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55b31216f9c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55b31216f9c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp7852e4/pdfsandwich20e86d_unpaper.pgm /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08  -l deu pdf

Error in l_generateCIDataForPdf: cid not made from file
Error during processing.
ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp7852e4/pdfsandwich2d5f64.tif /tmp/pdfsandwich_tmp7852e4/pdfsandwich60ad08  -l deu pdf " failed.
Terminating pdfsandwich. All temporary files are kept.

After replace 1.83.0 with 1.82.0, the same file is handled as expected:

# pdfsandwich -lang deu -gray -verbose -o 'test.pdf' 20230116_095121_3.pdf
pdfsandwich version 0.1.7
...
tesseract 5.3.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.3) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libopenjp2 2.5.0
...
Processing page 1.
identify -format "%w\n%h\n"  "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]"
convert -units PixelsPerInch  -colorspace gray -depth 8 -background white -flatten -alpha Off -density 300x300  "/tmp/pdfsandwich_tmp93c02c/pdfsandwich_inputfile09c03e.pdf[0]" /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm
unpaper --overwrite  --no-grayfilter --layout none /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
Processing sheet #1: /tmp/pdfsandwich_tmp93c02c/pdfsandwich09012e.pgm -> /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm
[pgm_pipe @ 0x562b4dcf59c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x562b4dcf59c0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x562b4dcf59c0] Encoder did not produce proper pts, making some up.
convert -units PixelsPerInch -density 300x300 /tmp/pdfsandwich_tmp93c02c/pdfsandwich29d08d_unpaper.pgm /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif
OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmp93c02c/pdfsandwich038143.tif /tmp/pdfsandwich_tmp93c02c/pdfsandwich91968e  -l deu pdf
OCR pdf generated. Renaming output file to /tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf

OCR done. Writing "test.pdf"
mv "/tmp/pdfsandwich_tmp93c02c/pdfsandwich42b301.pdf" "test.pdf"

test.pdf generated.

Done.
@DanBloomberg
Copy link
Owner

DanBloomberg commented Jan 24, 2023

I believe the problem is in pdfio2.c, lines 569-570.

        if (!cid)
            return ERROR_INT("cid not made from file", __func__, 1);

Please remove those two tlines and see if the test succeeds.

@swsch
Copy link
Author

swsch commented Jan 24, 2023

Removing these lines allows processing of similar files without error, so the patch should be good.

Many thanks for quick response.

@DanBloomberg
Copy link
Owner

Excellent. The fix is now in.

@swsch
Copy link
Author

swsch commented Jan 24, 2023

Will there be a point release including the patch? If not, I'll suggest adding the patch to the gentoo package, so that 1.83.0 will be working there, too.

@DanBloomberg
Copy link
Owner

@stweil

It's a bit of work to make a patch release. I'll follow the advice of the tesseract maintainers, which is why I left this issue open for now.

@stweil
Copy link
Collaborator

stweil commented Jan 25, 2023

Are you referring to a patch release 1.83.1? As the latest code is already prepared for 1.84.0, a patch release would need a branch 1.83 (I can add that if you want) and a list of patches which should be added.

Which commits after 1.83.0 should be included in the patch release, too? I'd suggest these commits:

Are there others?

@DanBloomberg
Copy link
Owner

That's a nice offer, Stefan.

I can also do it without a branch, modifying 1.84.0 --> 1.83.1 and including all existing commits.
Then wait a few days before changing 1.83.1 --> 1.84.0.

@DanBloomberg
Copy link
Owner

But on second thought, it might be easier for you. Those two commits are the only important ones.

@stweil
Copy link
Collaborator

stweil commented Jan 25, 2023

See pull request #660 which adds the required changes for 1.83.1 to the new branch 1.83.

@DanBloomberg
Copy link
Owner

Much thanks, Stefan. Except for a patch on 1.81, this is the only patch that has been required for 5 years, since 1.75.

Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants