-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyPDF2 throws exception during extract_text() #1533
Comments
At first glance, Looks like a duplicate of #1091 |
Thanks! I've tried this one and it seems to be working. However now there is an another issue: the returned text charset seems to be messed up a bit as Hungarian letters (iso-8859-2 / "Latin-2") are unreadable: I got this: sz♥mlakibocs♥t♦hoz t♣rt☺n☻ regisztr♥ci♦ Not sure if it's because of this particular PDF type but the rest of the invoices using similar alphapet looks fine :) |
@lenemeth can you provide your pdf please for review |
First Part fixing py-pdf#1091 (late) Analysis of 'Hungarian' py-pdf#1533 still in progress
@pubpub-zz please provide an email address so that I can send it. It contains personal data (invoice) so I don't want to publicly share it. Thanks for your understanding. |
@lenemeth I know that @pubpub-zz values privacy and I could imagine that he wants to keep his email address private. If you want, you can send it to me and I can forward it: [email protected] |
@MartinThoma sent via email. Please share with @pubpub-zz privately. |
I did. Thanks for sharing :-) |
error with multiple lines
@lenemeth, Can you check that the PR is now good for you. I will add a test for coverage |
test file for test coverage |
@pubpub-zz I've checked with all of my invoice types and works well. Thanks for the correction! |
Thank you for confirming that it works and thank you for sharing the PDF for investigation. We will close this issue once the PR is merged :-) I guess we will have a fixed version on PyPI on Sunday. @pubpub-zz Thank you so much for taking care of this again 🙏 |
I have tried to use PyPDF2 to chat with PDF with OpenAI and Langchian. For any PDF files which cannot be copied, it will throw "IndexError: list index out of range. " If I run the following code: from PyPDF2 import PdfReader reader = PdfReader(filePath) for page in reader.pages: For this type of PDF files, it will print nothing. Thanks. Guoping |
PyPDF2 is deprecated. Use pypdf. |
I'm working on a script that is parsing PDF invoices and I'm getting exception during pdf reading. This happens only with a specific type of PDF coming from a tapwater utility service provider company. However, all PDFs from them are failed to be parsed with the same error.
Environment
Windows 10
Code + PDF
I can share the PDF in email as it contains personal data (invoice). Let me know where to send it
Traceback
The text was updated successfully, but these errors were encountered: