Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot decrypt PDF missing 'ID' in trailer #608

Closed
richardmillson opened this issue Mar 6, 2021 · 0 comments
Closed

Cannot decrypt PDF missing 'ID' in trailer #608

richardmillson opened this issue Mar 6, 2021 · 0 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness workflow-encryption From a users perspective, encryption is the affected feature/workflow

Comments

@richardmillson
Copy link
Contributor

richardmillson commented Mar 6, 2021

Bug report

Some PDFs (e.g. encrypted_doc_no_id.pdf) are encrypted but do not contain an 'ID' value in their trailer, causing decryption to fail. This also affects pdfminer.six where I've opend this issue.

Steps to reproduce

from PyPDF2 import PdfFileReader

with open('encrypted_doc_no_id.pdf', 'rb') as fp:
    reader = PdfFileReader(fp)
    reader.decrypt('')

raises a KeyError: '/ID'.

Solution

As Apache PDFBox does, if no 'ID' is specified in the trailer then supply an array with two empty byte strings in its place.

from PyPDF2 import PdfFileReader
from PyPDF2.generic import ArrayObject, ByteStringObject, NameObject

with open('encrypted_doc_no_id.pdf', 'rb') as fp:
    reader = PdfFileReader(fp)
    print(reader.trailer)
    reader.trailer[NameObject('/ID')] = ArrayObject([ByteStringObject(b''), ByteStringObject(b'')])
    print(reader.trailer)
    reader.decrypt('')
    print(reader.getDocumentInfo())
    page = reader.getPage(1)
    print(page.extractText())

produces

{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0)}
{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0), '/ID': [b'', b'']}
{'/Producer': 'European Patent Office'}

and succesfully decrypts the PDF.

Next steps

If this project is still actively maintained I can open a PR. Otherwise I leave this issue here for other users that may encounter the same KeyError: '/ID' and wonder how to fix it.

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma MartinThoma added workflow-encryption From a users perspective, encryption is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness labels Apr 22, 2022
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this issue Apr 29, 2022
If no '/ID' key is present in self.trailer an array of two empty bytestrings is used in place of an '/ID'. This is how Apache PDFBox handles this case.

This makes PyPDF2 more robust to malformed PDFs.

Closes py-pdf#608
Closes py-pdf#610

Full credit for this one to Richard Millson - Martin Thoma only fixed a merge conflict

Co-authored-by: Richard Millson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness workflow-encryption From a users perspective, encryption is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants