Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Closed
Google-Autofuzz opened this issue Nov 13, 2020 · 7 comments · Fixed by #1030
Closed

Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Google-Autofuzz opened this issue Nov 13, 2020 · 7 comments · Fixed by #1030
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness

Comments

@Google-Autofuzz
Copy link

Google-Autofuzz commented Nov 13, 2020

When running the following code with the latest pypi version of PyPDF2 on the attached input results in an unexpected xml.parsers.expat.ExpatError:

MCVE: Code + PDF

Example document: test.pdf

from PyPDF2 import PdfReader

reader = PdfReader("test.pdf")
reader.xmp_metadata

Traceback

Traceback (most recent call last):
  File "foo.py", line 5, in <module>
    reader.xmp_metadata
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 318, in xmp_metadata
    return self.trailer[TK.ROOT].xmp_metadata  # type: ignore
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 671, in xmp_metadata
    metadata = XmpInformation(metadata)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/xmp.py", line 206, in __init__
    doc_root: Document = parseString(self.stream.get_data())
  File "/home/moose/.pyenv/versions/3.6.15/lib/python3.6/xml/dom/minidom.py", line 1968, in parseString
    return expatbuilder.parseString(string)
  File "/home/moose/.pyenv/versions/3.6.15/lib/python3.6/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "/home/moose/.pyenv/versions/3.6.15/lib/python3.6/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 53, column 15

Environment

$ python -c "import PyPDF2; print(PyPDF2.__version__)"
2.3.1-dev
@guillaume-uH57J9
Copy link

I have encountered a similar error with a real-world PDF when calling getXmpMetadata().
My PDF cannot be shared as it contains personal information, but this issue already has a test PDF.

PyPDF version 1.26.0-4
python version 3.9.2-3
Debian version 11.0 (testing)

Error message:

ExpatError('not well-formed (invalid token): line 5, column 87')

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness labels Apr 7, 2022
@MartinThoma
Copy link
Member

@guillaume-uH57J9 Could you please share a full traceback? Do you have an example PDF?

@MartinThoma MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jun 26, 2022
@guillaume-uH57J9
Copy link

@MartinThoma You will find an example PDF and a callstack in the first comment from @Google-Autofuzz

@MartinThoma MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests and removed needs-pdf The issue needs a PDF file to show the problem labels Jun 26, 2022
@MartinThoma
Copy link
Member

@guillaume-uH57J9 Thank you 🙏 I've completely missed that 😅

Sadly, the issue still occurs with the latest version of PyPDF2. It looks to me as if the included XML of the PDF document is broken. We might never be able to read the content, but we should raise a warning / a PyPDF2 expection that is more explicit.

@guillaume-uH57J9
Copy link

guillaume-uH57J9 commented Jun 26, 2022

@MartinThoma Yes, it would be better to either have a warning and return None, or throw an exception at the PyPDF2 level.

Expat is kind of an implementation details, so exposing expat exceptions is not ideal. At the moment, if you want to safely use PyPDF2, you have to import expat in order to catch that specific exception.

As an aside, help(PyPDF2.PdfFileReader.xmpMetadata) does not mention any exception at the moment, so you wouldn't know to catch any exception until you stumble upon this. If an exception is raised, it would be better to document it.

@MartinThoma
Copy link
Member

@guillaume-uH57J9 What do you think about #1030 ?

@guillaume-uH57J9
Copy link

I replied in #1030

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants