Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Google-Autofuzz · 2020-11-13T13:30:15Z

When running the following code with the latest pypi version of PyPDF2 on the attached input results in an unexpected xml.parsers.expat.ExpatError:

MCVE: Code + PDF

Example document: test.pdf

from PyPDF2 import PdfReader

reader = PdfReader("test.pdf")
reader.xmp_metadata

Traceback

Traceback (most recent call last):
  File "foo.py", line 5, in <module>
    reader.xmp_metadata
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 318, in xmp_metadata
    return self.trailer[TK.ROOT].xmp_metadata  # type: ignore
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 671, in xmp_metadata
    metadata = XmpInformation(metadata)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/xmp.py", line 206, in __init__
    doc_root: Document = parseString(self.stream.get_data())
  File "/home/moose/.pyenv/versions/3.6.15/lib/python3.6/xml/dom/minidom.py", line 1968, in parseString
    return expatbuilder.parseString(string)
  File "/home/moose/.pyenv/versions/3.6.15/lib/python3.6/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "/home/moose/.pyenv/versions/3.6.15/lib/python3.6/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 53, column 15

Environment

$ python -c "import PyPDF2; print(PyPDF2.__version__)"
2.3.1-dev

The text was updated successfully, but these errors were encountered:

guillaume-uH57J9 · 2021-05-24T09:40:15Z

I have encountered a similar error with a real-world PDF when calling getXmpMetadata().
My PDF cannot be shared as it contains personal information, but this issue already has a test PDF.

PyPDF version 1.26.0-4
python version 3.9.2-3
Debian version 11.0 (testing)

Error message:

ExpatError('not well-formed (invalid token): line 5, column 87')

MartinThoma · 2022-06-26T09:26:10Z

@guillaume-uH57J9 Could you please share a full traceback? Do you have an example PDF?

guillaume-uH57J9 · 2022-06-26T10:11:04Z

@MartinThoma You will find an example PDF and a callstack in the first comment from @Google-Autofuzz

MartinThoma · 2022-06-26T10:25:56Z

@guillaume-uH57J9 Thank you 🙏 I've completely missed that 😅

Sadly, the issue still occurs with the latest version of PyPDF2. It looks to me as if the included XML of the PDF document is broken. We might never be able to read the content, but we should raise a warning / a PyPDF2 expection that is more explicit.

guillaume-uH57J9 · 2022-06-26T10:49:15Z

@MartinThoma Yes, it would be better to either have a warning and return None, or throw an exception at the PyPDF2 level.

Expat is kind of an implementation details, so exposing expat exceptions is not ideal. At the moment, if you want to safely use PyPDF2, you have to import expat in order to catch that specific exception.

As an aside, help(PyPDF2.PdfFileReader.xmpMetadata) does not mention any exception at the moment, so you wouldn't know to catch any exception until you stumble upon this. If an exception is raised, it would be better to document it.

MartinThoma · 2022-06-26T11:09:11Z

@guillaume-uH57J9 What do you think about #1030 ?

guillaume-uH57J9 · 2022-06-27T19:56:08Z

I replied in #1030

Closes #585

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness labels Apr 7, 2022

MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jun 26, 2022

MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests and removed needs-pdf The issue needs a PDF file to show the problem labels Jun 26, 2022

MartinThoma mentioned this issue Jun 26, 2022

MAINT: Handle XML error when reading XmpInformation #1030

Merged

MartinThoma closed this as completed in #1030 Jun 30, 2022

MartinThoma added a commit that referenced this issue Jun 30, 2022

MAINT: Handle XML error when reading XmpInformation (#1030)

97f36bd

Closes #585

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Google-Autofuzz commented Nov 13, 2020 •

edited by MartinThoma

Loading

guillaume-uH57J9 commented May 24, 2021

MartinThoma commented Jun 26, 2022

guillaume-uH57J9 commented Jun 26, 2022

MartinThoma commented Jun 26, 2022

guillaume-uH57J9 commented Jun 26, 2022 •

edited

Loading

MartinThoma commented Jun 26, 2022

guillaume-uH57J9 commented Jun 27, 2022

Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Unexpected xml.parsers.expat.ExpatError on malformed PDF #585

Comments

Google-Autofuzz commented Nov 13, 2020 • edited by MartinThoma Loading

MCVE: Code + PDF

Traceback

Environment

guillaume-uH57J9 commented May 24, 2021

MartinThoma commented Jun 26, 2022

guillaume-uH57J9 commented Jun 26, 2022

MartinThoma commented Jun 26, 2022

guillaume-uH57J9 commented Jun 26, 2022 • edited Loading

MartinThoma commented Jun 26, 2022

guillaume-uH57J9 commented Jun 27, 2022

Google-Autofuzz commented Nov 13, 2020 •

edited by MartinThoma

Loading

guillaume-uH57J9 commented Jun 26, 2022 •

edited

Loading