-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'IndexError: list index out of range' when extracting text #1091
Comments
The same file gives a |
fwiw, i'm seeing a similar error with dump.pdf which is generated during the test suite of xml2rfc.
|
The code within the if block assumes that lst has index 0 and index 1. So the predicate should depend on lst having at least two elements. This resolves the error I described at py-pdf#1091 (comment) (I'm not sure that it would resolve the other issue raised by @MartinThoma)
I'm not sure that bb2d1db resolves this issue. looking at 966635.pdf (from the original report), and working from bb2d1db, when i do: r = PdfReader('966635.pdf')
p = r.pages[10].extract_text() I get this crash (ipython3 backtrace):
sorry for having commented here just because i also got an I think this report should be re-opened. |
Thank you for letting me know 🤗 |
The code within the if block assumes that `lst` has index 0 and index 1. Fixes py-pdf#1091 Related to py-pdf#1111
Trying it via https://www.pdf-online.com/osa/validate.aspx : Validating file "non-compliant.pdf" for conformance level pdf1.3
|
Similar exception (v3.0.1) :
at post-mortem
(unfortunately cannot publish the pdf ) |
@kxrob
|
Here is the stripped page - it causes the same error with |
@kxrob
|
First Part fixing py-pdf#1091 (late) Analysis of 'Hungarian' py-pdf#1533 still in progress
@kxrob a PR has been issued. If you can confirm it is fixing your issue too |
@kxrob |
I've got an IndexError when extracting text. The file opens fine in Chrome.
Environment
$ python -m platform Linux-5.4.0-121-generic-x86_64-with-glibc2.31 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.4.2
Code + PDF
The file:
pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf
It's
print(reader.pages[10].extract_text())
to be exact.The text was updated successfully, but these errors were encountered: