Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

Closed
udit-pandey-1 opened this issue May 26, 2024 · 26 comments
Closed

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

udit-pandey-1 opened this issue May 26, 2024 · 26 comments
Assignees
Labels
awaiting-response bug Something isn't working pdf

Comments

@udit-pandey-1
Copy link

Describe the bug
I am getting the following error when extracting text and images from pdf:
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm'
image

To Reproduce
The way I am using unstructured is:
image

Expected behavior
Ideally, all the images in the pdf must be extracted.
If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().

Environment Info
image

Any kind of quickfix to get elements even in case of failure would also be appreciated.

@udit-pandey-1 udit-pandey-1 added the bug Something isn't working label May 26, 2024
@udit-pandey-1 udit-pandey-1 changed the title bug/RuntimeError: Some images were not loaded. bug/PIL.UnidentifiedImageError: cannot identify image file May 26, 2024
@scanny scanny added the pdf label May 27, 2024
@MthwRobinson
Copy link
Contributor

Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client library, will split the PDF up and distribute across multiple workers and should give you faster processing times.

@vegetableman
Copy link

Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf
Appreciate your efforts.

@christinestraub
Copy link
Collaborator

christinestraub commented May 30, 2024

Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U
elements = partition(
    url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))

@vegetableman
Copy link

The latest versions worked for me 👍... I was using the specific versions mentioned here: #2566 (comment)
Thank you, Christine!

However, partition_pdf does not support loading pdf files through a url paramter unless i am mistaken. Had to use the parameter filename.

@christinestraub
Copy link
Collaborator

Yes, as of now, partition_pdf does not support loading pdf files through a url parameter. Do we plan to do this? @MthwRobinson

@MthwRobinson
Copy link
Contributor

We don't plan to add that in partition_pdf as of now, though I believe that works in partition and will detect the MIME type from the HTTP response.

@vegetableman
Copy link

@MthwRobinson that worked 👍 . My bad. Missed the module auto. Thank you!

@udit-pandey-1
Copy link
Author

udit-pandey-1 commented May 31, 2024

@christinestraub the issue is still occurring for me after upgrading the mentioned packages.

We are seeing this issue on Ubuntu 20.04.

@udit-pandey-1
Copy link
Author

@christinestraub
Copy link
Collaborator

christinestraub commented Jun 19, 2024

@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?

Environment:

unstructured==0.14.6
unstructured-inference==0.7.35

Code:

from unstructured.partition.auto import partition

elements = partition(
    url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)

print("\n\n".join([str(el) for el in elements]))

@udit-pandey-1
Copy link
Author

still the same @christinestraub

unstructured==0.14.6
unstructured-inference==0.7.36
image

@christinestraub
Copy link
Collaborator

@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?

  • libmagic-dev (filetype detection)
  • poppler-utils (images and PDfs)

@udit-pandey-1
Copy link
Author

libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.

@sanyamjain0315
Copy link

Has there been a progress in this issue? I am facing the same problem, even after having tried everything.

@tpakeman
Copy link

tpakeman commented Sep 3, 2024

Hi there I'm having the same issue:
Python 3.10.12

unstructured                     0.14.6
unstructured-client              0.25.6
unstructured-inference           0.7.35
unstructured.pytesseract         0.3.13

Unfortunately I can't share the documents as they contain proprietary information.

This is happening for every PDF in a folder of 50. All were generated from HTML files by downloading with Chrome and saving with PDF.

Stacktrace:

---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
[<ipython-input-21-a26b75af5795>](https://localhost:8080/#) in <cell line: 4>()
      4 for k in data.keys():
      5   fpath = f"/path/to/file/{k}"
----> 6   els = partition_pdf(filename=fpath, 
      7                       max_partition=1500,
      8                       chunking_strategy='by_title',

10 frames
[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    603             unique_element_ids: bool = call_args.get("unique_element_ids", False)
    604             if unique_element_ids is False:
--> 605                 elements = assign_and_map_hash_ids(elements)
    606 
    607             return elements

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
     72 
     73         # -- call the partitioning function to get the elements --
---> 74         elements = func(*args, **kwargs)
     75 
     76         # -- look for a chunking-strategy argument --

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    208         form_extraction_skip_tables=form_extraction_skip_tables,
    209         **kwargs,
--> 210     )
    211 
    212 

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    344     if isinstance(file, bytes):
    345         file = io.BytesIO(file)
--> 346     return _partition_pdf_with_pdfminer(
    347         filename=filename,
    348         file=file,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, ocr_languages, is_image, metadata_last_modified, starting_page_number, **kwargs)
    894             tmp_element = element
    895             tmp_text = element.text
--> 896             tmp_coords = element.metadata.coordinates
    897         elif tmp_element and check_coords_within_boundary(
    898             coordinates=element.metadata.coordinates,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf_image/pdf_image_utils.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
    414     date_from_file_object: bool = False,
    415 ) -> str | None:
--> 416     last_modification_date = None
    417     if not file and filename:
    418         last_modification_date = get_last_modified_date(filename=filename)

[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    267                 )
    268             else:
--> 269                 images += parse_buffer_func(data)
    270     finally:
    271         if auto_temp_dir:

[/usr/local/lib/python3.10/dist-packages/pdf2image/parsers.py](https://localhost:8080/#) in parse_buffer_to_ppm(data)
     26         size_x, size_y = tuple(size.split(b" "))
     27         file_size = len(code) + len(size) + len(rgb) + 3 + int(size_x) * int(size_y) * 3
---> 28         images.append(Image.open(BytesIO(data[index : index + file_size])))
     29         index += file_size
     30 

[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
   3281             raise TypeError(msg) from e
   3282     else:
-> 3283         rawmode = mode
   3284     if mode in ["1", "L", "I", "P", "F"]:
   3285         ndmax = 2

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7e086492d030>

@sidatcd
Copy link

sidatcd commented Dec 11, 2024

Same error:
certain files with ppm extension throws unidentified error on hires statergy

Packages

python 3.12
unstructured==0.16.10
unstructured-client==0.28.1
unstructured-inference==0.8.1
pytesseract==0.3.13
pillow==11.0.0
unstructured.pytesseract==0.3.13

@scanny
Copy link
Collaborator

scanny commented Dec 16, 2024

Closing as inactive. Cannot reproduce, assumed resolved. If you're still seeing this and can provide a file that reproduces the error I'll take another look.

@scanny scanny closed this as completed Dec 16, 2024
@sidatcd
Copy link

sidatcd commented Dec 17, 2024

Same error
python 3.12

Packages

unstructured==0.16.10
unstructured-client==0.28.1
unstructured-inference==0.8.1
pytesseract==0.3.13
pillow==11.0.0
unstructured.pytesseract==0.3.13

CODE:

from PIL import Image as PILImage
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

partitions = partition_pdf(
url=None,
filename=filename,
strategy="hi_res",
extract_images_in_pdf=True,
extract_image_block_types=["Image"],
extract_image_block_to_payload=True,
max_partition=None,
unique_element_ids=True,
extract_image_block_output_dir="/tmp", # Temporary directory to store images
)

Not for all files but atleast 30% of the files,
same error message
PIL.UnidentifiedImageError: cannot identify image file <temporary ppm file>

Cant share files as confidential data

@scanny
Copy link
Collaborator

scanny commented Dec 17, 2024

Okay, good at least you are able to still reproduce it. I have an idea where to look.

@scanny scanny reopened this Dec 17, 2024
@scanny scanny self-assigned this Dec 17, 2024
@scanny
Copy link
Collaborator

scanny commented Dec 17, 2024

@sidatcd can you provide me a fresh stack-trace? I can't make any sense of the one earlier in the thread, possibly because of its age.

Also, do you have reason to believe the the problematic PDF files on your side contain PPM images? Those are a pretty old format, like 1980's era, but seem to be the format it is complaining about.

@sidatcd
Copy link

sidatcd commented Dec 17, 2024

@scanny
Ha,
My initial thought was the same.
But saw the same error on fairly recent Pdfs as well.

relevant trace

line 48, in partition_document
    partitions = partition_pdf(
  File "/var/task/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/file_utils/filetype.py", line 725, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/file_utils/filetype.py", line 683, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/partition/pdf.py", line 209, in partition_pdf
    return partition_pdf_or_image(
  File "/var/task/unstructured/partition/pdf.py", line 305, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/var/task/unstructured/utils.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "/var/task/unstructured/partition/pdf.py", line 588, in _partition_pdf_or_image_local
    inferred_document_layout = process_file_with_model(
  File "/var/task/unstructured_inference/inference/layout.py", line 376, in process_file_with_model
    else DocumentLayout.from_file(
  File "/var/task/unstructured_inference/inference/layout.py", line 74, in from_file
    with Image.open(image_path) as image:
  File "/var/task/PIL/Image.py", line 3536, in open
    raise UnidentifiedImageError(msg)

Personally I would be happy if ppm files are not identified.

@scanny
Copy link
Collaborator

scanny commented Dec 17, 2024

Okay, looks like PPMs are coming from pdftoppm (part of poppler) as part of the process, so that explains the ppm bit anyway.

@scanny
Copy link
Collaborator

scanny commented Dec 17, 2024

@sidatcd Unfortunately I am unable to reproduce this with the PDF earlier in the thread. I'll have to close it for now because it's not actionable. If you are able to find a shareable document that produces the error we can reopen and I'll have another look.

Maybe you can narrow it down somehow, like maybe capturing the file that's causing the error, perhaps by printing out the path and copying the offending file to a new location where you can inspect it and/or post it, probably around this location in your local install:
https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L74

Knowing the type of that file, its size, whether it can be inspected or whether it's possibly corrupted or something, all those could be useful hints. Also whether it happens late in the file (when more memory has been consumed) or earlier.

Another idea is catching the exception at that location and just skipping the file and seeing what happens. It looks like that would skip whole pages, but which pages get skipped could also be interesting insight.

Also, if the machine you're running on is memory constrained and perhaps the files where this happens contain many or very big images, the PPM image format is not compressed, so it does potentially consume a lot of memory. If you can check it on another machine with a different amount of memory and see if it gets better or gets worse, that would also be an interesting observation.

@scanny scanny closed this as completed Dec 17, 2024
@sidatcd
Copy link

sidatcd commented Dec 17, 2024

@scanny
Is there a way not to extract selected image formats?

@sidatcd
Copy link

sidatcd commented Dec 17, 2024

One thing i noticed was that i couldn't replicate this on mac but only on linux containers or custom python containers.Could it be a specific version of poppler utils?

@scanny
Copy link
Collaborator

scanny commented Dec 17, 2024

@sidatcd Regarding the images, that threw me at first too. But what's happening in this step is the entire PDF document is being rendered to a series of "page" images in preparation for "vision" processing by the layout/object-detection model; not "extracting" embedded images per se.

poppler is being used for this job and possibly because of when it was originally written, it uses the now-uncommon PPM format for rendering those pages. PPM does have the advantage that it is uncompressed (so faster because no expensive compression). And, it turns out, it is supported by Pillow (PIL). In any case, all those page images are going to be in PPM format so we can't just filter out PPMs.

The code that does this page rendering is here, and it uses pdf2image (which are bindings to poppler) for the job:
https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L400


Regarding the Linux/Mac discrepancy:

  • That could be why I can't reproduce it, because I only have a Mac handy.
  • Versions are absolutely worth checking, I'd say poppler-utils, pdf2image (Python package), and Pillow (PIL) are all worth checking.
  • Definitely check differences in available memory. I've seen mention that poppler may just fail to render and not throw an error if it runs out of memory, could possibly be running out of memory mid-page or something and writing a truncated (and thereby corrupted) PPM file.

You could also try running the pdftoppm command-line program (part of poppler) on Linux against a problematic file and see what you get, possibly check hashes against what is produced on the Mac for the same file. I'd say that's definitely a good avenue to pursue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-response bug Something isn't working pdf
Projects
None yet
Development

No branches or pull requests

8 participants