Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range #1278

Closed
DL6ER opened this issue Aug 26, 2022 · 6 comments · Fixed by #1281
Closed

IndexError: list index out of range #1278

DL6ER opened this issue Aug 26, 2022 · 6 comments · Fixed by #1281
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected

Comments

@DL6ER
Copy link

DL6ER commented Aug 26, 2022

See #1269 for further details.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader

with open("Work Flow From Check to QA.pdf", "rb") as f:
  reader = PdfReader(f, strict=False)
  content = " ".join([page.extract_text() for page in reader.pages])

PDF used above: Work Flow From Check to QA.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    content = " ".join([page.extract_text() for page in reader.pages])
  File "test.py", line 4, in <listcomp>
    content = " ".join([page.extract_text() for page in reader.pages])
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1444, in _extract_text
    process_operation(operator, operands)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1258, in process_operation
    float(operands[5]),
IndexError: list index out of range
@pubpub-zz
Copy link
Collaborator

for local ref
Work Flow From Check to QA.pdf

@pubpub-zz
Copy link
Collaborator

The page contains an array of content stream. They have to be reassembled adding line breaks. PR #1281 will fix this issue

@MartinThoma
Copy link
Member

Thank you for reporting the issue @DL6ER ! I'll release a PyPDF2 version with the fix on Sunday to PyPI.

@MartinThoma
Copy link
Member

We value good error reports @DL6ER! I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html if you want :-)

@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected labels Aug 27, 2022
@DL6ER
Copy link
Author

DL6ER commented Aug 27, 2022

@MartinThoma Sure, you can add me if your want. I will "contribute" some more issues over the next days ;-)

@MartinThoma
Copy link
Member

You're added :-) (It might take ~5 minutes until the docs refresh)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants