Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output PDF has loss of data #1607

Closed
zain910128 opened this issue Feb 4, 2023 · 6 comments
Closed

Output PDF has loss of data #1607

zain910128 opened this issue Feb 4, 2023 · 6 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@zain910128
Copy link

zain910128 commented Feb 4, 2023

I am using the following code to resize pages in a PDF:

from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject

reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
    A4_w = PaperSize.A4.width
    A4_h = PaperSize.A4.height

    # resize page to fit *inside* A4
    h = float(page.mediabox.height)
    w = float(page.mediabox.width)
    scale_factor = min(A4_h / h, A4_w / w)

    transform = (
        Transformation()
        .scale(scale_factor, scale_factor)
        .translate(0, A4_h / 2 - h * scale_factor / 2)
    )
    page.add_transformation(transform)

    page.cropbox = RectangleObject((0, 0, A4_w, A4_h))

    # merge the pages to fit inside A4

    # prepare A4 blank page
    page_A4 = PageObject.create_blank_page(width=A4_w, height=A4_h)
    page.mediabox = page_A4.mediabox
    page_A4.merge_page(page)

    writer.add_page(page_A4)
writer.write("output.pdf")

Source: https://stackoverflow.com/a/75274841/11501160

While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.

I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.

What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages. The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.

screenshot of input file

and

screenshot of output file

The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions: https://support.apple.com/en-ca/HT202945 I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.

Is this a bug with pypdf or am I doing something wrong ? How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?

@mrknwk
Copy link

mrknwk commented Feb 5, 2023

This is indeed a pypdf bug which occurs since version 2.10.9. The problem will be fixed with tomorrow's update to version 3.4.0 (PR #1563).

When PDF pages are transformed with non-integer values, this results in floating point numbers with a precision of more than 19 decimal places. Unlike other PDF viewers, Acrobat cannot handle this. See Github issue #1376 for details.

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Feb 5, 2023
@pubpub-zz
Copy link
Collaborator

@zain910128
Can you confirm this PR can now be closed?

@pubpub-zz
Copy link
Collaborator

I close this issue as Fixed. @zain910128 feel free to provide more info if you want to re open it

@zain910128
Copy link
Author

Sorry, i didn't get the chance to check earlier.
But I checked it now and the issue is partially resolved.

The original input file that i provided is now converted properly and looks fine in all PDF viewers.

Then i tried with a new input file which is very similar and has one extra page. Attached here for reference:
input.pdf

The output of this file has colours all wrong in mac's Preview and google drive and the browser's pdf viewer, but fine in Adobe acrobat.

So I think we have to reopen this issue.

This may be related to my other issue here:
#1615

@MartinThoma MartinThoma reopened this Feb 13, 2023
@pubpub-zz
Copy link
Collaborator

The latest inputs look like a duplicate of #1615
@MartinThoma I propose to close it here and just keep track in #1615

@MartinThoma
Copy link
Member

That's fine for me :-) I also had the feeling the issue was duplicated :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests

4 participants