Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Size Increase from pypdf==3.8.1 to pypdf==3.9.0 when watermarking #1897

Closed
MartinThoma opened this issue Jun 19, 2023 · 8 comments · Fixed by #1906
Closed

File Size Increase from pypdf==3.8.1 to pypdf==3.9.0 when watermarking #1897

MartinThoma opened this issue Jun 19, 2023 · 8 comments · Fixed by #1906
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Jun 19, 2023

I've noticed that the watermarking process in my job suddenly produces way bigger PDFs. In a specific example with this PDF file I've noticed:

  • Base is 3.5 MB
  • Watermarking with pypdf==3.8.1 produces a 3.7 MB file
  • Watermarking with pypdf==3.9.0 produces a 5.2 MB file

I'm not sure if the issue is with 3.8.1 or with 3.9.0.

Code

from pypdf import PdfReader, PdfWriter
import pypdf
from io import BytesIO
from fpdf import FPDF  # pip install fpdf2


def create_stamp_pdf() -> BytesIO:
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("helvetica", "B", 16)
    pdf.cell(40, 10, "Hello World!")
    byte_string = pdf.output()
    return BytesIO(byte_string)


template = PdfReader(create_stamp_pdf())
template_page = template.pages[0]
reader = PdfReader("template.pdf")

writer = PdfWriter()
for page in reader.pages:
    page.merge_page(template_page)
    writer.add_page(page)

for page in writer.pages:
    page.compress_content_streams()

with open(f"out-{pypdf.__version__}.pdf", "wb") as fp:
    writer.write(fp)

Issue summary

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jun 19, 2023
@pubpub-zz
Copy link
Collaborator

@MartinThoma
can you please provide the test code ?

@martin-thoma
Copy link

martin-thoma commented Jun 19, 2023

@Lucas-C For the create_stamp_pdf above I get the warning

DeprecationWarning: "dest" parameter is deprecated, unused and will soon be removed

How should I write to a BytesIO stream instead? Would it maybe be possible to point users directly to what they should use instead?

@MartinThoma MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests and removed needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jun 19, 2023
@py-pdf py-pdf deleted a comment from martin-thoma Jun 19, 2023
@py-pdf py-pdf deleted a comment from martin-thoma Jun 19, 2023
@martin-thoma
Copy link

@pubpub-zz I've added the minimal example in the top :-)

@Lucas-C
Copy link
Member

Lucas-C commented Jun 20, 2023

How should I write to a BytesIO stream instead? Would it maybe be possible to point users directly to what they should use instead?

Calling .output() is fine, it's just the dest parameter that is not necesseray, you can simply get rid of it:
https://pyfpdf.github.io/fpdf2/fpdf/fpdf.html#fpdf.fpdf.FPDF.output

@MartinThoma
Copy link
Member Author

I just tried to find other files that show the same issue:

  • sample-files/009-pdflatex-geotopo/GeoTopo.pdf: 5.3 MB
  • After adding the overlay: 10.9 MB

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jun 20, 2023

It seems to be 2d67c15 that causes it. So pypdf>=3.9.0.

I confirm your analysis.
the core problem is identified. solution under building.

@MartinThoma
Copy link
Member Author

MartinThoma commented Jun 20, 2023

This might be a useful test:

@pytest.mark.samples()
@pytest.mark.slow()
#@pytest.mark.xfail(reason="Issue 1897")
def test_compression():
    def create_stamp_pdf() -> BytesIO:
        from fpdf import FPDF
        pdf = FPDF()
        pdf.add_page()
        pdf.set_font("helvetica", "B", 16)
        pdf.cell(40, 10, "Hello World!")
        byte_string = pdf.output()
        return BytesIO(byte_string)


    template = PdfReader(create_stamp_pdf())
    template_page = template.pages[0]
    reader = PdfReader(SAMPLE_ROOT / "009-pdflatex-geotopo/GeoTopo.pdf")

    writer = PdfWriter()
    for page in reader.pages:
        page.merge_page(template_page)
        writer.add_page(page)

    for page in writer.pages:
        page.compress_content_streams()

    b = BytesIO()
    writer.write(b)
    b.seek(0)
    output_data = len(b.read())
    assert output_data < 6 * 10**6

with pypdf==3.8.1 the output_data is 5784167 bytes (about 5.8 MB).

@MartinThoma
Copy link
Member Author

@pubpub-zz I think I will make a release with this fix today. I've just updated the first post with a "Issue Summary". I think I will do the same for all new issues. This way it is easier for people to see if/how long they were affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants