You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently had a PDF that look hours to be processed by PyPDF2. The reason is that this PDF had multiple large inline images (up to 15 MB uncompressed) and ContentStream._readInlineImage is really inefficient:
The last while-loop only reads one byte at a time.
In each iteration this single byte is added to data. Since data is immutable, a complete copy has to be created in memory.
So when the inline image has a size of MB, a multi-MB large data has to be copied in memory millions of times. This takes ages.
You can easily create such a PDF with Pillow and reportlab with a large PNG like this one:
I recently had a PDF that look hours to be processed by PyPDF2. The reason is that this PDF had multiple large inline images (up to 15 MB uncompressed) and ContentStream._readInlineImage is really inefficient:
data
. Sincedata
is immutable, a complete copy has to be created in memory.So when the inline image has a size of MB, a multi-MB large
data
has to be copied in memory millions of times. This takes ages.You can easily create such a PDF with Pillow and reportlab with a large PNG like this one:
Then try to load the inline image:
I will soon prepare a pull request that fixes this issue.
The text was updated successfully, but these errors were encountered: