Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPDF2 throws exception during extract_text() #1533

Closed
lenemeth opened this issue Jan 6, 2023 · 13 comments · Fixed by #1544
Closed

PyPDF2 throws exception during extract_text() #1533

lenemeth opened this issue Jan 6, 2023 · 13 comments · Fixed by #1544
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@lenemeth
Copy link

lenemeth commented Jan 6, 2023

I'm working on a script that is parsing PDF invoices and I'm getting exception during pdf reading. This happens only with a specific type of PDF coming from a tapwater utility service provider company. However, all PDFs from them are failed to be parsed with the same error.

Environment

Windows 10

c:\>python --version
Python 3.11.1

c:\>pip show pyPdf2
Name: PyPDF2
Version: 3.0.1
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page:
Author:
Author-email: Mathieu Fenniak <[email protected]>
License:
Location: C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires:
Required-by:

Code + PDF

from PyPDF2 import PdfReader
reader = PdfReader(filePath)

for page in reader.pages:
     text = page.extract_text()

I can share the PDF in email as it contains personal data (invoice). Let me know where to send it

Traceback

Traceback (most recent call last):
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 63, in <module>
    em.parse_invoices()
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 22, in parse_invoices
    self.ip.parse_invoices(self.config['input_data']['invoices']['directory_path'])
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 47, in parse_invoices
    self.extract_pdf(os.path.join(directory, file))
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 63, in extract_pdf
    text = page.extract_text()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
                                             ^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
    multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
    nbi = max(len(lst[0]), len(lst[1]))
                               ~~~^^^
IndexError: list index out of range
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 6, 2023

At first glance, Looks like a duplicate of #1091
A PR and a fix is proposed can you try it

@lenemeth
Copy link
Author

lenemeth commented Jan 6, 2023

Thanks! I've tried this one and it seems to be working. However now there is an another issue: the returned text charset seems to be messed up a bit as Hungarian letters (iso-8859-2 / "Latin-2") are unreadable:

I got this: sz♥mlakibocs♥t♦hoz t♣rt☺n☻ regisztr♥ci♦
Should look like this: számlakibocsátóhoz történő regisztráció

Not sure if it's because of this particular PDF type but the rest of the invoices using similar alphapet looks fine :)

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jan 6, 2023
@pubpub-zz
Copy link
Collaborator

@lenemeth can you provide your pdf please for review

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 8, 2023
First Part fixing py-pdf#1091 (late)
Analysis of  'Hungarian' py-pdf#1533 still in progress
@lenemeth
Copy link
Author

lenemeth commented Jan 8, 2023

@lenemeth can you provide your pdf please for review

@pubpub-zz please provide an email address so that I can send it. It contains personal data (invoice) so I don't want to publicly share it. Thanks for your understanding.

@MartinThoma
Copy link
Member

@lenemeth I know that @pubpub-zz values privacy and I could imagine that he wants to keep his email address private. If you want, you can send it to me and I can forward it: [email protected]

@lenemeth
Copy link
Author

lenemeth commented Jan 9, 2023

@MartinThoma sent via email. Please share with @pubpub-zz privately.

@MartinThoma
Copy link
Member

I did. Thanks for sharing :-)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 9, 2023
error with multiple lines
@pubpub-zz
Copy link
Collaborator

@lenemeth,
thanks for your contribution the extraction was buggy with unicode cmap where ranges were set on multiple lines.

Can you check that the PR is now good for you. I will add a test for coverage

@pubpub-zz
Copy link
Collaborator

test file for test coverage
iss1533.pdf

@lenemeth
Copy link
Author

lenemeth commented Jan 9, 2023

@pubpub-zz I've checked with all of my invoice types and works well. Thanks for the correction!

@lenemeth lenemeth closed this as completed Jan 9, 2023
@MartinThoma MartinThoma reopened this Jan 9, 2023
@MartinThoma
Copy link
Member

Thank you for confirming that it works and thank you for sharing the PDF for investigation. We will close this issue once the PR is merged :-) I guess we will have a fixed version on PyPI on Sunday.

@pubpub-zz Thank you so much for taking care of this again 🙏

MartinThoma pushed a commit that referenced this issue Jan 21, 2023
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023
@gzeng11
Copy link

gzeng11 commented Jul 29, 2023

I have tried to use PyPDF2 to chat with PDF with OpenAI and Langchian. For any PDF files which cannot be copied, it will throw "IndexError: list index out of range. "

If I run the following code:

from PyPDF2 import PdfReader

reader = PdfReader(filePath)

for page in reader.pages:
text = page.extract_text()
print(text)

For this type of PDF files, it will print nothing.

Thanks.

Guoping

@MartinThoma
Copy link
Member

PyPDF2 is deprecated. Use pypdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants