Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError when accessing images #1801

Closed
elisabethzinck opened this issue Apr 17, 2023 · 8 comments
Closed

OSError when accessing images #1801

elisabethzinck opened this issue Apr 17, 2023 · 8 comments

Comments

@elisabethzinck
Copy link

elisabethzinck commented Apr 17, 2023

I am trying to extract text from a pdf, where I first try extracting the text using extract_text(). If that fails, I want to get the image(s) so that I can extract the text using OCR-technology.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.2.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf.__version__)"
3.8.0

I have also installed Pillow==9.4.0.

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

file_path = 'test_img.pdf'
reader = PdfReader(file_path)

page0 = reader.pages[0]
text = page0.extract_text()

if text.strip() == '':
    page0.images

The pdf used in the example is included below:
test_img.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/PIL/PngImagePlugin.py", line 1286, in _save
    rawmode, mode = _OUTMODES[mode]
                    ~~~~~~~~~^^^^^^
KeyError: 'PA'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_page.py", line 444, in images
    extension, byte_stream = _xobj_to_image(x_object[obj])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/filters.py", line 690, in _xobj_to_image
    img.save(img_byte_arr, format="PNG")
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/PIL/Image.py", line 2432, in save
    save_handler(self, fp, filename)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/PIL/PngImagePlugin.py", line 1289, in _save
    raise OSError(msg) from e
OSError: cannot write mode PA as PNG

Is there any way the format can be determined in pypdf so we don't get the error from PIL? Or is there a way around the error?

MartinThoma added a commit that referenced this issue Apr 17, 2023
The PDF contained an image in PA mode:
* P: 8-bit pixels, mapped to any other mode using a color palette
* PA: P with alpha

See #1801
@MartinThoma
Copy link
Member

Thank you for adding such a good issue description :-)

I think I found & fixed the problem in #1802 . Would you mind to check if that solves the issue for you?

@MartinThoma
Copy link
Member

I'm a bit confused as the extracted image does not look like the PDF. So my fix is likely flawed. Does somebody have an idea what the issue is?

@elisabethzinck
Copy link
Author

Glad you liked my issue description - it's the first issue I've opened :)

Yes, when installing the issue-1801 branch version, I don't get the error anymore. But yes, in the output image the text "Test" is written with some strange horizontal lines.

@MartinThoma
Copy link
Member

MartinThoma commented Apr 18, 2023

I think I made it work :-) Here is the image: img-Im1

Is it ok if I add test_img.pdf to https://github.com/py-pdf/sample-files ?

@elisabethzinck
Copy link
Author

Awesome - it works perfectly now! Thank you for your help.

Sure! :)

MartinThoma added a commit that referenced this issue Apr 19, 2023
The PDF contained an image in PA mode:
* P: 8-bit pixels, mapped to any other mode using a color palette
* PA: P with alpha

See #1801
@MartinThoma
Copy link
Member

The fix was just merged to main and will be released on Sunday with pypdf > 3.8.0

@MartinThoma
Copy link
Member

Thank you for your help!

If you want, I can add you as a contributor: https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@elisabethzinck
Copy link
Author

Great - looking forward to use the new version.

Thanks for the offer, but no thanks. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants