-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError in images.items() if the PDF text content has " BI " present in it. #2456
Comments
Thanks for reporting. Do you have a suitable fix in mind which would resolve such issues? The only way I could think of directly would be to search for the triple |
The issue is more likely to identify if the BI is within a 'string' or not. If BI is within a string, the number of "(" and ")" before the BI should be the same. In order to confirm improve the test I need a test file the with following text: @snanda85, can you produce it ? |
@stefan6419846 Have not implemented a fix yet. There are multiple lines of thinking. Solution 1. Refill the Solution 2. Fix the regexes to handle all scenarios where BI text can appear in content. This does not seem fullproof. Solution 3. Simply catch the KeyError when reading images. This ensures the code doesn't break and does not add any processing overhead in valid scenarios. For now, this is what I have implemented in my wrapper code. Pls share your thoughts, and I will raise a PR with the implementation. @pubpub-zz Here you go. An improved test file with more combinations. |
I've completed your test file adding an inline image: |
page.images.items()
and looping over images break if the text content of the PDF has the content ' BI ' in it (BI surrounded by whitespaces)Environment
Code + PDF
This is a minimal, complete example that shows the issue:
Attached is a sample PDF I created that can reproduce this error.
BI_test.pdf
The PDF can be added to the tests as well.
Traceback
This is the complete output and traceback I see:
The text was updated successfully, but these errors were encountered: