-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/PIL.UnidentifiedImageError: cannot identify image file #3102
Comments
Hi @udit-pandey-1 - could you provide a URL that we could use to reproduce? I'd also give our SaaS API a try. Our |
Hi @MthwRobinson, I got the above error in this file: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf |
Hi @vegetableman, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.
|
The latest versions worked for me 👍... I was using the specific versions mentioned here: #2566 (comment) However, |
Yes, as of now, |
We don't plan to add that in |
@MthwRobinson that worked 👍 . My bad. Missed the module |
@christinestraub the issue is still occurring for me after upgrading the mentioned packages. We are seeing this issue on Ubuntu 20.04. |
here is a reference pdf file for it: |
@udit-pandey-1, I tried to partition the reference pdf file on both MacOS and Ubuntu(22.04). It worked as expected and I couldn't reproduce the error. Can you please try again? Environment:
Code:
|
still the same @christinestraub
|
@udit-pandey-1 I was wondering if you are sure that you installed the following system dependencies?
|
libmagic-dev was'nt there. Installed it and then used the same code as above. Still failed with the same error. |
Has there been a progress in this issue? I am facing the same problem, even after having tried everything. |
Hi there I'm having the same issue:
Unfortunately I can't share the documents as they contain proprietary information. This is happening for every PDF in a folder of 50. All were generated from HTML files by downloading with Chrome and saving with PDF. Stacktrace:
|
Same error: Packages
|
Closing as inactive. Cannot reproduce, assumed resolved. If you're still seeing this and can provide a file that reproduces the error I'll take another look. |
Same error Packages
CODE:
Not for all files but atleast 30% of the files, Cant share files as confidential data |
Okay, good at least you are able to still reproduce it. I have an idea where to look. |
@sidatcd can you provide me a fresh stack-trace? I can't make any sense of the one earlier in the thread, possibly because of its age. Also, do you have reason to believe the the problematic PDF files on your side contain PPM images? Those are a pretty old format, like 1980's era, but seem to be the format it is complaining about. |
@scanny relevant trace line 48, in partition_document
partitions = partition_pdf(
File "/var/task/unstructured/documents/elements.py", line 581, in wrapper
elements = func(*args, **kwargs)
File "/var/task/unstructured/file_utils/filetype.py", line 725, in wrapper
elements = func(*args, **kwargs)
File "/var/task/unstructured/file_utils/filetype.py", line 683, in wrapper
elements = func(*args, **kwargs)
File "/var/task/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/var/task/unstructured/partition/pdf.py", line 209, in partition_pdf
return partition_pdf_or_image(
File "/var/task/unstructured/partition/pdf.py", line 305, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
File "/var/task/unstructured/utils.py", line 216, in wrapper
return func(*args, **kwargs)
File "/var/task/unstructured/partition/pdf.py", line 588, in _partition_pdf_or_image_local
inferred_document_layout = process_file_with_model(
File "/var/task/unstructured_inference/inference/layout.py", line 376, in process_file_with_model
else DocumentLayout.from_file(
File "/var/task/unstructured_inference/inference/layout.py", line 74, in from_file
with Image.open(image_path) as image:
File "/var/task/PIL/Image.py", line 3536, in open
raise UnidentifiedImageError(msg) Personally I would be happy if ppm files are not identified. |
Okay, looks like PPMs are coming from |
@sidatcd Unfortunately I am unable to reproduce this with the PDF earlier in the thread. I'll have to close it for now because it's not actionable. If you are able to find a shareable document that produces the error we can reopen and I'll have another look. Maybe you can narrow it down somehow, like maybe capturing the file that's causing the error, perhaps by printing out the path and copying the offending file to a new location where you can inspect it and/or post it, probably around this location in your local install: Knowing the type of that file, its size, whether it can be inspected or whether it's possibly corrupted or something, all those could be useful hints. Also whether it happens late in the file (when more memory has been consumed) or earlier. Another idea is catching the exception at that location and just skipping the file and seeing what happens. It looks like that would skip whole pages, but which pages get skipped could also be interesting insight. Also, if the machine you're running on is memory constrained and perhaps the files where this happens contain many or very big images, the PPM image format is not compressed, so it does potentially consume a lot of memory. If you can check it on another machine with a different amount of memory and see if it gets better or gets worse, that would also be an interesting observation. |
@scanny |
One thing i noticed was that i couldn't replicate this on mac but only on linux containers or custom python containers.Could it be a specific version of poppler utils? |
@sidatcd Regarding the images, that threw me at first too. But what's happening in this step is the entire PDF document is being rendered to a series of "page" images in preparation for "vision" processing by the layout/object-detection model; not "extracting" embedded images per se.
The code that does this page rendering is here, and it uses Regarding the Linux/Mac discrepancy:
You could also try running the |
Describe the bug
I am getting the following error when extracting text and images from pdf:
PIL.UnidentifiedImageError: cannot identify image file '/tmp/tmpjy0tjjjd/2c2e244f-8f8e-46de-a7bc-2ecfbaa254ea-566.ppm'
To Reproduce
The way I am using unstructured is:
Expected behavior
Ideally, all the images in the pdf must be extracted.
If at all there is a failure, image extraction must not fail abruptly for the complete document(in my case, the pdf has 800 pages and it fails after going through 600 pages). For the layouts where image extraction failed, we can add a flag in the metadata that conveys that the image extraction failed and also provide reason for it. We should be able to get elements even in case of failures through a flag that is passed when calling partition().
Environment Info
Any kind of quickfix to get elements even in case of failure would also be appreciated.
The text was updated successfully, but these errors were encountered: