Cannot decrypt PDF missing 'ID' in trailer #608

richardmillson · 2021-03-06T13:43:02Z

Bug report

Some PDFs (e.g. encrypted_doc_no_id.pdf) are encrypted but do not contain an 'ID' value in their trailer, causing decryption to fail. This also affects pdfminer.six where I've opend this issue.

Steps to reproduce

from PyPDF2 import PdfFileReader

with open('encrypted_doc_no_id.pdf', 'rb') as fp:
    reader = PdfFileReader(fp)
    reader.decrypt('')

raises a KeyError: '/ID'.

Solution

As Apache PDFBox does, if no 'ID' is specified in the trailer then supply an array with two empty byte strings in its place.

from PyPDF2 import PdfFileReader
from PyPDF2.generic import ArrayObject, ByteStringObject, NameObject

with open('encrypted_doc_no_id.pdf', 'rb') as fp:
    reader = PdfFileReader(fp)
    print(reader.trailer)
    reader.trailer[NameObject('/ID')] = ArrayObject([ByteStringObject(b''), ByteStringObject(b'')])
    print(reader.trailer)
    reader.decrypt('')
    print(reader.getDocumentInfo())
    page = reader.getPage(1)
    print(page.extractText())

produces

{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0)}
{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0), '/ID': [b'', b'']}
{'/Producer': 'European Patent Office'}

and succesfully decrypts the PDF.

Next steps

If this project is still actively maintained I can open a PR. Otherwise I leave this issue here for other users that may encounter the same KeyError: '/ID' and wonder how to fix it.

The text was updated successfully, but these errors were encountered:

If no '/ID' key is present in self.trailer an array of two empty bytestrings is used in place of an '/ID'. This is how Apache PDFBox handles this case. This makes PyPDF2 more robust to malformed PDFs. Closes py-pdf#608 Closes py-pdf#610 Full credit for this one to Richard Millson - Martin Thoma only fixed a merge conflict Co-authored-by: Richard Millson <[email protected]>

richardmillson mentioned this issue Mar 6, 2021

Fix 608 use null ID when encrypted but no ID given #610

Closed

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022

MartinThoma added workflow-encryption From a users perspective, encryption is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness labels Apr 22, 2022

MartinThoma closed this as completed in 663ca98 Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot decrypt PDF missing 'ID' in trailer #608

Cannot decrypt PDF missing 'ID' in trailer #608

richardmillson commented Mar 6, 2021 •

edited

Loading

Cannot decrypt PDF missing 'ID' in trailer #608

Cannot decrypt PDF missing 'ID' in trailer #608

Comments

richardmillson commented Mar 6, 2021 • edited Loading

richardmillson commented Mar 6, 2021 •

edited

Loading