#5 Using PdfReader causes a crash #2866

Avgor46 · 2024-09-23T09:49:08Z

Hi!

Another crash similar to previous ones. Pdf and stderr could be found below.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-56-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.0, crypt_provider=('cryptography', '3.1'), PIL=none

commit 6cfa0c4

Code + PDF

This is a minimal, complete example that shows the issue:

#! /usr/bin/env python3

import pypdf
from pypdf.errors import EmptyFileError, PdfReadError, PdfStreamError
import sys

def TestOneInput(fname):
  try:
    pdf_reader = pypdf.PdfReader(fname)
    for page_number, page in enumerate(pdf_reader.pages):
        page.extract_text()
  except (EmptyFileError, PdfReadError, PdfStreamError):
      pass

if __name__ == "__main__":
    if len(sys.argv) < 2:
        exit(1)
    TestOneInput(sys.argv[1])

PoC

crash-e108c4f677040b61e12fa9f1cfde025d704c9b0d.pdf

Traceback

This is the complete stderr I see:

PdfReadError("Invalid Elementary Object starting with b'M' @276: b'ASGAA+Arial,Unicode MS\\n/DescendantFonts [ 56 0 R ]\\n/Encoding /Identity-H\\n/Subtyp'")
Traceback (most recent call last):
  File "/fuzz/./poc.py", line 18, in <module>
    TestOneInput(sys.argv[1])
  File "/fuzz/./poc.py", line 11, in TestOneInput
    page.extract_text()
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 2266, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_page.py", line 1761, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_cmap.py", line 32, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/usr/local/lib/python3.9/dist-packages/pypdf/_cmap.py", line 53, in build_char_map_from_dict
    font_type: str = cast(str, ft["/Subtype"])
  File "/usr/local/lib/python3.9/dist-packages/pypdf/generic/_data_structures.py", line 441, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/Subtype'

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-09-26T05:20:13Z

your file does not respect the PDF spec as a name object contains a space whereas it should be encoded as #20
currently this stops the parsing of the object not creating the missing fields.
PR in progress to not stop the processing.

closes py-pdf#2866

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 26, 2024

ROB: continue parsing dictionnary object when error is detected

e3e6756

closes py-pdf#2866

pubpub-zz mentioned this issue Sep 26, 2024

ROB: continue parsing dictionnary object when error is detected #2872

Merged

stefan6419846 closed this as completed in #2872 Sep 27, 2024

stefan6419846 closed this as completed in 762fc1f Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#5 Using PdfReader causes a crash #2866

#5 Using PdfReader causes a crash #2866

Avgor46 commented Sep 23, 2024

pubpub-zz commented Sep 26, 2024

#5 Using PdfReader causes a crash #2866

#5 Using PdfReader causes a crash #2866

Comments

Avgor46 commented Sep 23, 2024

Environment

Code + PDF

PoC

Traceback

pubpub-zz commented Sep 26, 2024