ContentStream._readInlineImage is really slow on large inline images #330

sekrause · 2017-02-17T12:08:27Z

I recently had a PDF that look hours to be processed by PyPDF2. The reason is that this PDF had multiple large inline images (up to 15 MB uncompressed) and ContentStream._readInlineImage is really inefficient:

The last while-loop only reads one byte at a time.
In each iteration this single byte is added to data. Since data is immutable, a complete copy has to be created in memory.

So when the inline image has a size of MB, a multi-MB large data has to be copied in memory millions of times. This takes ages.

You can easily create such a PDF with Pillow and reportlab with a large PNG like this one:

from PIL import Image
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen.canvas import Canvas

logo = Image.open('inline-image.png')
canvas = Canvas('inline-image', pagesize=A4)
canvas.drawInlineImage(logo, 10, 10)
canvas.showPage()
canvas.save()

Then try to load the inline image:

import sys

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream

with open(sys.argv[1], 'rb') as f:
    pdf = PdfFileReader(f, strict=False)
    for page in pdf.pages:
        contentstream = ContentStream(page.getContents(), pdf)
        for operands, command in contentstream.operations:
            if command == b'INLINE IMAGE':
                data = operands['data']
                print(len(data))

I will soon prepare a pull request that fixes this issue.

The text was updated successfully, but these errors were encountered:

Closes #329 - potential infinite loop (SEC) Closes #330 - performance issue of ContentStream._readInlineImage (PERF)

sekrause mentioned this issue Feb 17, 2017

Improved performance and security for ContentStream_readInlineImage #331

Closed

MartinThoma added the nf-performance Non-functional change: Performance label Apr 9, 2022

sekrause mentioned this issue Apr 12, 2022

Improved performance and security for ContentStream_readInlineImage. #740

Merged

MartinThoma closed this as completed in #740 Apr 15, 2022

MartinThoma pushed a commit that referenced this issue Apr 15, 2022

SEC/PERF: ContentStream_readInlineImage (#740)

d71fb3e

Closes #329 - potential infinite loop (SEC) Closes #330 - performance issue of ContentStream._readInlineImage (PERF)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentStream._readInlineImage is really slow on large inline images #330

ContentStream._readInlineImage is really slow on large inline images #330

sekrause commented Feb 17, 2017 •

edited

Loading

ContentStream._readInlineImage is really slow on large inline images #330

ContentStream._readInlineImage is really slow on large inline images #330

Comments

sekrause commented Feb 17, 2017 • edited Loading

sekrause commented Feb 17, 2017 •

edited

Loading