Output PDF has loss of data #1607

zain910128 · 2023-02-04T23:22:15Z

I am using the following code to resize pages in a PDF:

from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject

reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
    A4_w = PaperSize.A4.width
    A4_h = PaperSize.A4.height

    # resize page to fit *inside* A4
    h = float(page.mediabox.height)
    w = float(page.mediabox.width)
    scale_factor = min(A4_h / h, A4_w / w)

    transform = (
        Transformation()
        .scale(scale_factor, scale_factor)
        .translate(0, A4_h / 2 - h * scale_factor / 2)
    )
    page.add_transformation(transform)

    page.cropbox = RectangleObject((0, 0, A4_w, A4_h))

    # merge the pages to fit inside A4

    # prepare A4 blank page
    page_A4 = PageObject.create_blank_page(width=A4_w, height=A4_h)
    page.mediabox = page_A4.mediabox
    page_A4.merge_page(page)

    writer.add_page(page_A4)
writer.write("output.pdf")

Source: https://stackoverflow.com/a/75274841/11501160

While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.

I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.

What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages. The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.

and

The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions: https://support.apple.com/en-ca/HT202945 I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.

Is this a bug with pypdf or am I doing something wrong ? How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?

mrknwk · 2023-02-05T00:18:54Z

This is indeed a pypdf bug which occurs since version 2.10.9. The problem will be fixed with tomorrow's update to version 3.4.0 (PR #1563).

When PDF pages are transformed with non-integer values, this results in floating point numbers with a precision of more than 19 decimal places. Unlike other PDF viewers, Acrobat cannot handle this. See Github issue #1376 for details.

pubpub-zz · 2023-02-07T21:04:04Z

@zain910128
Can you confirm this PR can now be closed?

pubpub-zz · 2023-02-12T11:46:46Z

I close this issue as Fixed. @zain910128 feel free to provide more info if you want to re open it

zain910128 · 2023-02-13T03:02:29Z

Sorry, i didn't get the chance to check earlier.
But I checked it now and the issue is partially resolved.

The original input file that i provided is now converted properly and looks fine in all PDF viewers.

Then i tried with a new input file which is very similar and has one extra page. Attached here for reference:
input.pdf

The output of this file has colours all wrong in mac's Preview and google drive and the browser's pdf viewer, but fine in Adobe acrobat.

So I think we have to reopen this issue.

This may be related to my other issue here:
#1615

pubpub-zz · 2023-02-13T17:10:58Z

The latest inputs look like a duplicate of #1615
@MartinThoma I propose to close it here and just keep track in #1615

MartinThoma · 2023-02-13T21:27:57Z

That's fine for me :-) I also had the feeling the issue was duplicated :-)

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Feb 5, 2023

pubpub-zz closed this as completed Feb 12, 2023

MartinThoma reopened this Feb 13, 2023

pubpub-zz mentioned this issue Feb 13, 2023

Output pdf has wrong colour, incorrect translation of markup, and wrong scaling #1615

Closed

MartinThoma closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output PDF has loss of data #1607

Output PDF has loss of data #1607

zain910128 commented Feb 4, 2023 •

edited by MartinThoma

Loading

mrknwk commented Feb 5, 2023 •

edited

Loading

pubpub-zz commented Feb 7, 2023

pubpub-zz commented Feb 12, 2023

zain910128 commented Feb 13, 2023

pubpub-zz commented Feb 13, 2023

MartinThoma commented Feb 13, 2023

Output PDF has loss of data #1607

Output PDF has loss of data #1607

Comments

zain910128 commented Feb 4, 2023 • edited by MartinThoma Loading

mrknwk commented Feb 5, 2023 • edited Loading

pubpub-zz commented Feb 7, 2023

pubpub-zz commented Feb 12, 2023

zain910128 commented Feb 13, 2023

pubpub-zz commented Feb 13, 2023

MartinThoma commented Feb 13, 2023

zain910128 commented Feb 4, 2023 •

edited by MartinThoma

Loading

mrknwk commented Feb 5, 2023 •

edited

Loading