KeyError in images.items() if the PDF text content has " BI " present in it. #2456

snanda85 · 2024-02-15T09:47:58Z

page.images.items() and looping over images break if the text content of the PDF has the content ' BI ' in it (BI surrounded by whitespaces)

Environment

$ python -m platform
Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader("BI_test.pdf")

for page in reader.pages:
    print(f"Image Keys: {page.images.keys()}")
    print(page.images.items())

Attached is a sample PDF I created that can reproduce this error.
BI_test.pdf

The PDF can be added to the tests as well.

Traceback

This is the complete output and traceback I see:

$ python BI_test.py
Image Keys: ['~0~']
Traceback (most recent call last):
  File "/home/ubuntu/repos/permute/policy-pdf-parser/BI_test.py", line 7, in <module>
    print(page.images.items())
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2392, in items
    return [(x, self[x]) for x in self.ids_function()]
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2392, in <listcomp>
    return [(x, self[x]) for x in self.ids_function()]
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2412, in __getitem__
    return self.get_function(index)
  File "/home/ubuntu/repos/permute/policy-pdf-parser/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 494, in _get_image
    return self.inline_images[id]
KeyError: '~0~'

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-02-15T11:01:02Z

Thanks for reporting. Do you have a suitable fix in mind which would resolve such issues? The only way I could think of directly would be to search for the triple BI ... ID ... EI, but this could be tricked as well.

pubpub-zz · 2024-02-15T18:45:02Z

Thanks for reporting. Do you have a suitable fix in mind which would resolve such issues? The only way I could think of directly would be to search for the triple BI ... ID ... EI, but this could be tricked as well.

The issue is more likely to identify if the BI is within a 'string' or not. If BI is within a string, the number of "(" and ")" before the BI should be the same. In order to confirm improve the test I need a test file the with following text:
This is test also with BI in the text
I put first ) and then an other BI

@snanda85, can you produce it ?

snanda85 · 2024-02-16T06:39:56Z

@stefan6419846 Have not implemented a fix yet. There are multiple lines of thinking.

Solution 1. Refill the inline_images_keys while reading the inline images in _get_inline_images, instead of solely relying on the _get_ids_image.

Solution 2. Fix the regexes to handle all scenarios where BI text can appear in content. This does not seem fullproof.

Solution 3. Simply catch the KeyError when reading images. This ensures the code doesn't break and does not add any processing overhead in valid scenarios. For now, this is what I have implemented in my wrapper code.

Pls share your thoughts, and I will raise a PR with the implementation.

@pubpub-zz Here you go. An improved test file with more combinations.
BI_test_2.pdf
BI_test_2_qdf.pdf

pubpub-zz · 2024-02-18T14:35:45Z

I've completed your test file adding an inline image:
BI_text_with_one_image.pdf

fixes py-pdf#2456

Fixes #2456

stefan6419846 added the workflow-images From a users perspective, image handling is the affected feature/workflow label Feb 15, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Feb 18, 2024

BUG: BI in text content identified as image tag

afb162c

fixes py-pdf#2456

pubpub-zz mentioned this issue Feb 18, 2024

BUG: BI in text content identified as image tag #2459

Merged

stefan6419846 closed this as completed in #2459 Feb 20, 2024

stefan6419846 pushed a commit that referenced this issue Feb 20, 2024

BUG: BI in text content identified as image tag (#2459)

9245c6a

Fixes #2456

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError in images.items() if the PDF text content has " BI " present in it. #2456

KeyError in images.items() if the PDF text content has " BI " present in it. #2456

snanda85 commented Feb 15, 2024

stefan6419846 commented Feb 15, 2024

pubpub-zz commented Feb 15, 2024

snanda85 commented Feb 16, 2024 •

edited

Loading

pubpub-zz commented Feb 18, 2024

KeyError in images.items() if the PDF text content has " BI " present in it. #2456

KeyError in images.items() if the PDF text content has " BI " present in it. #2456

Comments

snanda85 commented Feb 15, 2024

Environment

Code + PDF

Traceback

stefan6419846 commented Feb 15, 2024

pubpub-zz commented Feb 15, 2024

snanda85 commented Feb 16, 2024 • edited Loading

pubpub-zz commented Feb 18, 2024

snanda85 commented Feb 16, 2024 •

edited

Loading