Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in images.items() if the PDF text content has " BI " present in it. #2456

Closed
snanda85 opened this issue Feb 15, 2024 · 4 comments · Fixed by #2459
Closed

KeyError in images.items() if the PDF text content has " BI " present in it. #2456

snanda85 opened this issue Feb 15, 2024 · 4 comments · Fixed by #2459
Labels
workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@snanda85
Copy link
Contributor

page.images.items() and looping over images break if the text content of the PDF has the content ' BI ' in it (BI surrounded by whitespaces)

Environment

$ python -m platform
Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("BI_test.pdf")

for page in reader.pages:
    print(f"Image Keys: {page.images.keys()}")
    print(page.images.items())

Attached is a sample PDF I created that can reproduce this error.
BI_test.pdf

The PDF can be added to the tests as well.

Traceback

This is the complete output and traceback I see:

$ python BI_test.py
Image Keys: ['~0~']
Traceback (most recent call last):
  File "/home/ubuntu/repos/permute/policy-pdf-parser/BI_test.py", line 7, in <module>
    print(page.images.items())
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2392, in items
    return [(x, self[x]) for x in self.ids_function()]
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2392, in <listcomp>
    return [(x, self[x]) for x in self.ids_function()]
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2412, in __getitem__
    return self.get_function(index)
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 494, in _get_image
    return self.inline_images[id]
KeyError: '~0~'
@stefan6419846 stefan6419846 added the workflow-images From a users perspective, image handling is the affected feature/workflow label Feb 15, 2024
@stefan6419846
Copy link
Collaborator

Thanks for reporting. Do you have a suitable fix in mind which would resolve such issues? The only way I could think of directly would be to search for the triple BI ... ID ... EI, but this could be tricked as well.

@pubpub-zz
Copy link
Collaborator

Thanks for reporting. Do you have a suitable fix in mind which would resolve such issues? The only way I could think of directly would be to search for the triple BI ... ID ... EI, but this could be tricked as well.

The issue is more likely to identify if the BI is within a 'string' or not. If BI is within a string, the number of "(" and ")" before the BI should be the same. In order to confirm improve the test I need a test file the with following text:
This is test also with BI in the text
I put first ) and then an other BI

@snanda85, can you produce it ?

@snanda85
Copy link
Contributor Author

snanda85 commented Feb 16, 2024

@stefan6419846 Have not implemented a fix yet. There are multiple lines of thinking.

Solution 1. Refill the inline_images_keys while reading the inline images in _get_inline_images, instead of solely relying on the _get_ids_image.

Solution 2. Fix the regexes to handle all scenarios where BI text can appear in content. This does not seem fullproof.

Solution 3. Simply catch the KeyError when reading images. This ensures the code doesn't break and does not add any processing overhead in valid scenarios. For now, this is what I have implemented in my wrapper code.

Pls share your thoughts, and I will raise a PR with the implementation.

@pubpub-zz Here you go. An improved test file with more combinations.
BI_test_2.pdf
BI_test_2_qdf.pdf

@pubpub-zz
Copy link
Collaborator

I've completed your test file adding an inline image:
BI_text_with_one_image.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants