Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using .DOCX format in cloud - suggestion on the below error? #410

Closed
acsankar opened this issue Nov 22, 2024 · 6 comments · Fixed by #432 or #650
Closed

Using .DOCX format in cloud - suggestion on the below error? #410

acsankar opened this issue Nov 22, 2024 · 6 comments · Fixed by #432 or #650
Assignees
Labels
docx issue related to docx backend question Further information is requested

Comments

@acsankar
Copy link

I am trying to use this in cloud and just trying to convert it to markdown without images. Assuming below error is coming when there are images in document. Any suggestions to fix this?

doc_converter = DocumentConverter(
allowed_formats=[InputFormat.DOCX],
format_options={
InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline),
},
)

I am getting below error
---> 30 result = doc_converter.convert(temp_file.name)

18 frames
/usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self)
375 if loader is None:
376 msg = f"cannot find loader for this {self.format} file"
--> 377 raise OSError(msg)
378 image = loader.load(self)
379 assert image is not None

OSError: cannot find loader for this WMF file

@acsankar acsankar added the question Further information is requested label Nov 22, 2024
@PeterStaar-IBM
Copy link
Contributor

@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?

@acsankar
Copy link
Author

@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?

Thanks for the reply. The whole stack is below.
I am using Colab and trying to access the documents in Google bucket. Similar to python-docx, I was trying to use IO.bytes but Dockling expects the .docx format so converting it to a temp file and trying to load it in docling. This is working for few other document but doesn't work for one of the 200 page document so I thought it could be due to pictures in the document to see if there is anyway to skip that for .docx format like Docling has suppress image reading while converting to Docling document. Please let me know if it helps

<tempfile._TemporaryFileWrapper object at 0x7bfe9c3b00d0>
/tmp/tmpxytgsj20.docx

OSError Traceback (most recent call last)
in <cell line: 46>()
44
45
---> 46 result = doc_converter.convert(temp_file.name)
47 markdown_file_np = result.document.export_to_markdown()
48 print(markdown_file_np)

18 frames
/usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self)
375 if loader is None:
376 msg = f"cannot find loader for this {self.format} file"
--> 377 raise OSError(msg)
378 image = loader.load(self)
379 assert image is not None

OSError: cannot find loader for this WMF file

Below is the code:
#convert to markdown
from google.cloud import storage
import tempfile

storage_client = storage.Client()

gcs_path = file_path
bucket_name = bucket_name
blob_name = f"{source_folder}/{file_name}"

bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)

print(bucket)
print(blob_name)

temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.docx')
print(temp_file)
blob.download_to_filename(temp_file.name)

doc_converter = DocumentConverter(
allowed_formats=[InputFormat.DOCX],
format_options={
InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline),
},
)

result = doc_converter.convert(temp_file.name)
markdown_file_np = result.document.export_to_markdown()
print(markdown_file_np)

@maxmnemonic
Copy link
Contributor

I think the most likely problem here is that word file includes an image, blob of which can't be loaded by PIL library.
Error should trigger, even if file would be fully local.

@acsankar, any chance you could make an example file for this error?

@maxmnemonic maxmnemonic mentioned this issue Nov 25, 2024
3 tasks
@maxmnemonic
Copy link
Contributor

Draft of the PR that should fix the error: #432

@maxmnemonic
Copy link
Contributor

@acsankar Fresh release of Docling 2.7.1, includes fixes!

@Tendo33
Copy link
Contributor

Tendo33 commented Dec 10, 2024

This error also occurs in PPTX, and I am currently using this version:

docling                   2.10.0
docling-core              2.9.0
docling-ibm-models        2.0.7
docling-parse             3.0.0

The error message is:

    conv_res = pptx_converter.convert(pptx_file_path)
  File "/opt/mfapi/lib/python3.10/site-packages/pydantic/validate_call_decorator.py", line 59, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
  File "/opt/mfapi/lib/python3.10/site-packages/pydantic/_internal/_validate_call.py", line 81, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 172, in convert
    return next(all_res)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 193, in convert_all
    for conv_res in conv_res_iter:
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 228, in _convert
    for item in map(
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 269, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 292, in _execute_pipeline
    conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 52, in execute
    raise e
  File "/opt/mfapi/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 44, in execute
    conv_res = self._build_document(conv_res)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/pipeline/simple_pipeline.py", line 41, in _build_document
    conv_res.document = conv_res.input._backend.convert()
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 97, in convert
    doc = self.walk_linear(self.pptx_obj, doc)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 406, in walk_linear
    handle_shapes(shape, parent_slide, slide_ind, doc)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 384, in handle_shapes
    self.handle_pictures(shape, parent_slide, slide_ind, doc)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 285, in handle_pictures
    image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
  File "/opt/mfapi/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 484, in from_pil
    image.save(buffered, format="PNG")
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/Image.py", line 2528, in save
    self._ensure_mutable()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/Image.py", line 639, in _ensure_mutable
    self._copy()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/Image.py", line 632, in _copy
    self.load()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/WmfImagePlugin.py", line 163, in load
    return super().load()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/ImageFile.py", line 377, in load
    raise OSError(msg)
OSError: cannot find loader for this WMF file

I believe you might need to make such considerations for files of all formats.🤣
@maxmnemonic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docx issue related to docx backend question Further information is requested
Projects
None yet
5 participants