bug/PIL.UnidentifiedImageError: cannot identify image file #3102

udit-pandey-1 · 2024-05-26T10:11:51Z

Describe the bug
I am getting the following error when extracting text and images from pdf:
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm'

To Reproduce
The way I am using unstructured is:

Expected behavior
Ideally, all the images in the pdf must be extracted.
If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().

Environment Info

Any kind of quickfix to get elements even in case of failure would also be appreciated.

The text was updated successfully, but these errors were encountered:

MthwRobinson · 2024-05-28T12:09:26Z

Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our unstructured-python-client library, will split the PDF up and distribute across multiple workers and should give you faster processing times.

vegetableman · 2024-05-30T02:37:39Z

Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf
Appreciate your efforts.

christinestraub · 2024-05-30T13:08:34Z

Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U

elements = partition(
    url="https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)
print("\n\n".join([str(el) for el in elements]))

vegetableman · 2024-05-30T16:54:30Z

The latest versions worked for me 👍... I was using the specific versions mentioned here: #2566 (comment)
Thank you, Christine!

However, partition_pdf does not support loading pdf files through a url paramter unless i am mistaken. Had to use the parameter filename.

christinestraub · 2024-05-30T17:23:23Z

Yes, as of now, partition_pdf does not support loading pdf files through a url parameter. Do we plan to do this? @MthwRobinson

MthwRobinson · 2024-05-30T17:33:06Z

We don't plan to add that in partition_pdf as of now, though I believe that works in partition and will detect the MIME type from the HTTP response.

vegetableman · 2024-05-31T03:23:29Z

@MthwRobinson that worked 👍 . My bad. Missed the module auto. Thank you!

udit-pandey-1 · 2024-05-31T04:26:19Z

@christinestraub the issue is still occurring for me after upgrading the mentioned packages.

We are seeing this issue on Ubuntu 20.04.

udit-pandey-1 · 2024-05-31T07:24:27Z

here is a reference pdf file for it:
https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf

christinestraub · 2024-06-19T19:13:33Z

@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again?

Environment:

unstructured==0.14.6
unstructured-inference==0.7.35

Code:

from unstructured.partition.auto import partition

elements = partition(
    url="https://docs.oracle.com/en/database/other-databases/essbase/21/essdm/database-administrators-guide-oracle-essbase.pdf",
    include_page_breaks=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
    skip_infer_table_types=[],
)

print("\n\n".join([str(el) for el in elements]))

udit-pandey-1 · 2024-06-28T10:04:33Z

still the same @christinestraub

unstructured==0.14.6
unstructured-inference==0.7.36

christinestraub · 2024-06-28T16:50:53Z

@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?

libmagic-dev (filetype detection)
poppler-utils (images and PDfs)

udit-pandey-1 · 2024-07-01T10:19:11Z

libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error.

sanyamjain0315 · 2024-08-12T14:27:13Z

Has there been a progress in this issue? I am facing the same problem, even after having tried everything.

tpakeman · 2024-09-03T15:39:18Z

Hi there I'm having the same issue:
Python 3.10.12

unstructured                     0.14.6
unstructured-client              0.25.6
unstructured-inference           0.7.35
unstructured.pytesseract         0.3.13

Unfortunately I can't share the documents as they contain proprietary information.

This is happening for every PDF in a folder of 50. All were generated from HTML files by downloading with Chrome and saving with PDF.

Stacktrace:

---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
[<ipython-input-21-a26b75af5795>](https://localhost:8080/#) in <cell line: 4>()
      4 for k in data.keys():
      5   fpath = f"/path/to/file/{k}"
----> 6   els = partition_pdf(filename=fpath, 
      7                       max_partition=1500,
      8                       chunking_strategy='by_title',

10 frames
[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    603             unique_element_ids: bool = call_args.get("unique_element_ids", False)
    604             if unique_element_ids is False:
--> 605                 elements = assign_and_map_hash_ids(elements)
    606 
    607             return elements

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py in wrapper(*args, **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/dispatch.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
     72 
     73         # -- call the partitioning function to get the elements --
---> 74         elements = func(*args, **kwargs)
     75 
     76         # -- look for a chunking-strategy argument --

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    208         form_extraction_skip_tables=form_extraction_skip_tables,
    209         **kwargs,
--> 210     )
    211 
    212 

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    344     if isinstance(file, bytes):
    345         file = io.BytesIO(file)
--> 346     return _partition_pdf_with_pdfminer(
    347         filename=filename,
    348         file=file,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, ocr_languages, is_image, metadata_last_modified, starting_page_number, **kwargs)
    894             tmp_element = element
    895             tmp_text = element.text
--> 896             tmp_coords = element.metadata.coordinates
    897         elif tmp_element and check_coords_within_boundary(
    898             coordinates=element.metadata.coordinates,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf_image/pdf_image_utils.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
    414     date_from_file_object: bool = False,
    415 ) -> str | None:
--> 416     last_modification_date = None
    417     if not file and filename:
    418         last_modification_date = get_last_modified_date(filename=filename)

[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in convert_from_path(pdf_path, dpi, output_folder, first_page, last_page, fmt, jpegopt, thread_count, userpw, ownerpw, use_cropbox, strict, transparent, single_file, output_file, poppler_path, grayscale, size, paths_only, use_pdftocairo, timeout, hide_annotations)
    267                 )
    268             else:
--> 269                 images += parse_buffer_func(data)
    270     finally:
    271         if auto_temp_dir:

[/usr/local/lib/python3.10/dist-packages/pdf2image/parsers.py](https://localhost:8080/#) in parse_buffer_to_ppm(data)
     26         size_x, size_y = tuple(size.split(b" "))
     27         file_size = len(code) + len(size) + len(rgb) + 3 + int(size_x) * int(size_y) * 3
---> 28         images.append(Image.open(BytesIO(data[index : index + file_size])))
     29         index += file_size
     30 

[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
   3281             raise TypeError(msg) from e
   3282     else:
-> 3283         rawmode = mode
   3284     if mode in ["1", "L", "I", "P", "F"]:
   3285         ndmax = 2

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7e086492d030>

sidatcd · 2024-12-11T01:10:03Z

Same error:
certain files with ppm extension throws unidentified error on hires statergy

Packages

python 3.12
unstructured==0.16.10
unstructured-client==0.28.1
unstructured-inference==0.8.1
pytesseract==0.3.13
pillow==11.0.0
unstructured.pytesseract==0.3.13

scanny · 2024-12-16T21:18:40Z

Closing as inactive. Cannot reproduce, assumed resolved. If you're still seeing this and can provide a file that reproduces the error I'll take another look.

sidatcd · 2024-12-17T03:56:46Z

Same error
python 3.12

Packages

unstructured==0.16.10
unstructured-client==0.28.1
unstructured-inference==0.8.1
pytesseract==0.3.13
pillow==11.0.0
unstructured.pytesseract==0.3.13

CODE:

from PIL import Image as PILImage
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

partitions = partition_pdf(
url=None,
filename=filename,
strategy="hi_res",
extract_images_in_pdf=True,
extract_image_block_types=["Image"],
extract_image_block_to_payload=True,
max_partition=None,
unique_element_ids=True,
extract_image_block_output_dir="/tmp", # Temporary directory to store images
)

Not for all files but atleast 30% of the files,
same error message
PIL.UnidentifiedImageError: cannot identify image file <temporary ppm file>

Cant share files as confidential data

scanny · 2024-12-17T04:48:52Z

Okay, good at least you are able to still reproduce it. I have an idea where to look.

scanny · 2024-12-17T05:08:20Z

@sidatcd can you provide me a fresh stack-trace? I can't make any sense of the one earlier in the thread, possibly because of its age.

Also, do you have reason to believe the the problematic PDF files on your side contain PPM images? Those are a pretty old format, like 1980's era, but seem to be the format it is complaining about.

sidatcd · 2024-12-17T06:00:16Z

@scanny
Ha,
My initial thought was the same.
But saw the same error on fairly recent Pdfs as well.

relevant trace

line 48, in partition_document
    partitions = partition_pdf(
  File "/var/task/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/file_utils/filetype.py", line 725, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/file_utils/filetype.py", line 683, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/var/task/unstructured/partition/pdf.py", line 209, in partition_pdf
    return partition_pdf_or_image(
  File "/var/task/unstructured/partition/pdf.py", line 305, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/var/task/unstructured/utils.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "/var/task/unstructured/partition/pdf.py", line 588, in _partition_pdf_or_image_local
    inferred_document_layout = process_file_with_model(
  File "/var/task/unstructured_inference/inference/layout.py", line 376, in process_file_with_model
    else DocumentLayout.from_file(
  File "/var/task/unstructured_inference/inference/layout.py", line 74, in from_file
    with Image.open(image_path) as image:
  File "/var/task/PIL/Image.py", line 3536, in open
    raise UnidentifiedImageError(msg)

Personally I would be happy if ppm files are not identified.

scanny · 2024-12-17T06:01:16Z

Okay, looks like PPMs are coming from pdftoppm (part of poppler) as part of the process, so that explains the ppm bit anyway.

scanny · 2024-12-17T19:44:47Z

@sidatcd Unfortunately I am unable to reproduce this with the PDF earlier in the thread. I'll have to close it for now because it's not actionable. If you are able to find a shareable document that produces the error we can reopen and I'll have another look.

Maybe you can narrow it down somehow, like maybe capturing the file that's causing the error, perhaps by printing out the path and copying the offending file to a new location where you can inspect it and/or post it, probably around this location in your local install:
https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L74

Knowing the type of that file, its size, whether it can be inspected or whether it's possibly corrupted or something, all those could be useful hints. Also whether it happens late in the file (when more memory has been consumed) or earlier.

Another idea is catching the exception at that location and just skipping the file and seeing what happens. It looks like that would skip whole pages, but which pages get skipped could also be interesting insight.

Also, if the machine you're running on is memory constrained and perhaps the files where this happens contain many or very big images, the PPM image format is not compressed, so it does potentially consume a lot of memory. If you can check it on another machine with a different amount of memory and see if it gets better or gets worse, that would also be an interesting observation.

sidatcd · 2024-12-17T19:47:36Z

@scanny
Is there a way not to extract selected image formats?

sidatcd · 2024-12-17T19:52:19Z

One thing i noticed was that i couldn't replicate this on mac but only on linux containers or custom python containers.Could it be a specific version of poppler utils?

scanny · 2024-12-17T20:16:11Z

@sidatcd Regarding the images, that threw me at first too. But what's happening in this step is the entire PDF document is being rendered to a series of "page" images in preparation for "vision" processing by the layout/object-detection model; not "extracting" embedded images per se.

poppler is being used for this job and possibly because of when it was originally written, it uses the now-uncommon PPM format for rendering those pages. PPM does have the advantage that it is uncompressed (so faster because no expensive compression). And, it turns out, it is supported by Pillow (PIL). In any case, all those page images are going to be in PPM format so we can't just filter out PPMs.

The code that does this page rendering is here, and it uses pdf2image (which are bindings to poppler) for the job:
https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L400

Regarding the Linux/Mac discrepancy:

That could be why I can't reproduce it, because I only have a Mac handy.
Versions are absolutely worth checking, I'd say poppler-utils, pdf2image (Python package), and Pillow (PIL) are all worth checking.
Definitely check differences in available memory. I've seen mention that poppler may just fail to render and not throw an error if it runs out of memory, could possibly be running out of memory mid-page or something and writing a truncated (and thereby corrupted) PPM file.

You could also try running the pdftoppm command-line program (part of poppler) on Linux against a problematic file and see what you get, possibly check hashes against what is produced on the Mac for the same file. I'd say that's definitely a good avenue to pursue.

udit-pandey-1 added the bug Something isn't working label May 26, 2024

udit-pandey-1 changed the title ~~bug/RuntimeError: Some images were not loaded.~~ bug/PIL.UnidentifiedImageError: cannot identify image file May 26, 2024

udit-pandey-1 mentioned this issue May 26, 2024

RuntimeError: Some images were not loaded. Unstructured-IO/unstructured-inference#354

Closed

scanny added the pdf label May 27, 2024

MthwRobinson added the awaiting-response label May 30, 2024

MthwRobinson added needs follow up and removed awaiting-response labels Jun 10, 2024

christinestraub added the awaiting-response label Jun 21, 2024

scanny closed this as completed Dec 16, 2024

scanny reopened this Dec 17, 2024

scanny self-assigned this Dec 17, 2024

scanny mentioned this issue Dec 17, 2024

UnidentifiedImageError: cannot identify image file '/tmp/tmpa3o9dj66/b5d7995b-82db-4257-bdcb-20795a00c72b-01.ppm' #3474

Closed

scanny closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

udit-pandey-1 commented May 26, 2024

MthwRobinson commented May 28, 2024

vegetableman commented May 30, 2024

christinestraub commented May 30, 2024 •

edited

Loading

vegetableman commented May 30, 2024

christinestraub commented May 30, 2024

MthwRobinson commented May 30, 2024

vegetableman commented May 31, 2024

udit-pandey-1 commented May 31, 2024 •

edited

Loading

udit-pandey-1 commented May 31, 2024

christinestraub commented Jun 19, 2024 •

edited

Loading

udit-pandey-1 commented Jun 28, 2024

christinestraub commented Jun 28, 2024

udit-pandey-1 commented Jul 1, 2024

sanyamjain0315 commented Aug 12, 2024

tpakeman commented Sep 3, 2024

sidatcd commented Dec 11, 2024

scanny commented Dec 16, 2024

sidatcd commented Dec 17, 2024 •

edited

Loading

scanny commented Dec 17, 2024

scanny commented Dec 17, 2024

sidatcd commented Dec 17, 2024 •

edited by scanny

Loading

scanny commented Dec 17, 2024

scanny commented Dec 17, 2024

sidatcd commented Dec 17, 2024 •

edited

Loading

sidatcd commented Dec 17, 2024

scanny commented Dec 17, 2024

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

bug/PIL.UnidentifiedImageError: cannot identify image file #3102

Comments

udit-pandey-1 commented May 26, 2024

MthwRobinson commented May 28, 2024

vegetableman commented May 30, 2024

christinestraub commented May 30, 2024 • edited Loading

vegetableman commented May 30, 2024

christinestraub commented May 30, 2024

MthwRobinson commented May 30, 2024

vegetableman commented May 31, 2024

udit-pandey-1 commented May 31, 2024 • edited Loading

udit-pandey-1 commented May 31, 2024

christinestraub commented Jun 19, 2024 • edited Loading

udit-pandey-1 commented Jun 28, 2024

christinestraub commented Jun 28, 2024

udit-pandey-1 commented Jul 1, 2024

sanyamjain0315 commented Aug 12, 2024

tpakeman commented Sep 3, 2024

sidatcd commented Dec 11, 2024

scanny commented Dec 16, 2024

sidatcd commented Dec 17, 2024 • edited Loading

scanny commented Dec 17, 2024

scanny commented Dec 17, 2024

sidatcd commented Dec 17, 2024 • edited by scanny Loading

scanny commented Dec 17, 2024

scanny commented Dec 17, 2024

sidatcd commented Dec 17, 2024 • edited Loading

sidatcd commented Dec 17, 2024

scanny commented Dec 17, 2024

christinestraub commented May 30, 2024 •

edited

Loading

udit-pandey-1 commented May 31, 2024 •

edited

Loading

christinestraub commented Jun 19, 2024 •

edited

Loading

sidatcd commented Dec 17, 2024 •

edited

Loading

sidatcd commented Dec 17, 2024 •

edited by scanny

Loading

sidatcd commented Dec 17, 2024 •

edited

Loading