Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text extraction throws IndexError on some PDFs #2290

Closed
sescobar99 opened this issue Nov 10, 2023 · 2 comments
Closed

Text extraction throws IndexError on some PDFs #2290

sescobar99 opened this issue Nov 10, 2023 · 2 comments

Comments

@sescobar99
Copy link

sescobar99 commented Nov 10, 2023

Recently I ran into a particular kind of pdf file from which I cannot extract text because the library throws an exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)

Sample PDF file can be found here:
example.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
    File "...\prueba_pdf\test.py", line 6, in <module>
        text = page.extract_text()
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text
        return self._extract_text(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text
        cmaps[f] = build_char_map(f, space_width, obj)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map
        font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict
        map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode
        return type1_alternative(ft, map_dict, space_code, int_entry)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range
@sescobar99
Copy link
Author

As a matter of fact, I can successfully execute the same code using PyPDF2 (v 3.0.1) and get the proper text extracted.

@sescobar99 sescobar99 changed the title Text extraction does not work on some PDFs Text extraction throws IndexError on some PDFs Nov 23, 2023
@Takher
Copy link
Contributor

Takher commented Nov 25, 2023

Looks like a simple fix. Have created the PR here, ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants