Text extraction throws IndexError on some PDFs #2290

sescobar99 · 2023-11-10T21:50:48Z

Recently I ran into a particular kind of pdf file from which I cannot extract text because the library throws an exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)

Sample PDF file can be found here:
example.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
    File "...\prueba_pdf\test.py", line 6, in <module>
        text = page.extract_text()
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text
        return self._extract_text(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text
        cmaps[f] = build_char_map(f, space_width, obj)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map
        font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict
        map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode
        return type1_alternative(ft, map_dict, space_code, int_entry)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

sescobar99 · 2023-11-10T22:00:31Z

As a matter of fact, I can successfully execute the same code using PyPDF2 (v 3.0.1) and get the proper text extracted.

Takher · 2023-11-25T16:54:11Z

Looks like a simple fix. Have created the PR here, ready for review.

sescobar99 changed the title ~~Text extraction does not work on some PDFs~~ Text extraction throws IndexError on some PDFs Nov 23, 2023

This was referenced Nov 25, 2023

BUG: check words length in _cmap #2309

Closed

BUG: check words length in _cmap type1_alternative function #2310

Merged

MartinThoma closed this as completed in 13a640d Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction throws IndexError on some PDFs #2290

Text extraction throws IndexError on some PDFs #2290

sescobar99 commented Nov 10, 2023 •

edited

Loading

sescobar99 commented Nov 10, 2023

Takher commented Nov 25, 2023

Text extraction throws IndexError on some PDFs #2290

Text extraction throws IndexError on some PDFs #2290

Comments

sescobar99 commented Nov 10, 2023 • edited Loading

Environment

Code + PDF

Traceback

sescobar99 commented Nov 10, 2023

Takher commented Nov 25, 2023

sescobar99 commented Nov 10, 2023 •

edited

Loading