Using .DOCX format in cloud - suggestion on the below error? #410

acsankar · 2024-11-22T00:58:18Z

I am trying to use this in cloud and just trying to convert it to markdown without images. Assuming below error is coming when there are images in document. Any suggestions to fix this?

doc_converter = DocumentConverter(
allowed_formats=[InputFormat.DOCX],
format_options={
InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline),
},
)

I am getting below error
---> 30 result = doc_converter.convert(temp_file.name)

18 frames
/usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self)
375 if loader is None:
376 msg = f"cannot find loader for this {self.format} file"
--> 377 raise OSError(msg)
378 image = loader.load(self)
379 assert image is not None

OSError: cannot find loader for this WMF file

PeterStaar-IBM · 2024-11-22T05:11:05Z

@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?

acsankar · 2024-11-22T06:48:20Z

@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?

Thanks for the reply. The whole stack is below.
I am using Colab and trying to access the documents in Google bucket. Similar to python-docx, I was trying to use IO.bytes but Dockling expects the .docx format so converting it to a temp file and trying to load it in docling. This is working for few other document but doesn't work for one of the 200 page document so I thought it could be due to pictures in the document to see if there is anyway to skip that for .docx format like Docling has suppress image reading while converting to Docling document. Please let me know if it helps

<tempfile._TemporaryFileWrapper object at 0x7bfe9c3b00d0>
/tmp/tmpxytgsj20.docx

OSError Traceback (most recent call last)
in <cell line: 46>()
44
45
---> 46 result = doc_converter.convert(temp_file.name)
47 markdown_file_np = result.document.export_to_markdown()
48 print(markdown_file_np)

18 frames
/usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self)
375 if loader is None:
376 msg = f"cannot find loader for this {self.format} file"
--> 377 raise OSError(msg)
378 image = loader.load(self)
379 assert image is not None

OSError: cannot find loader for this WMF file

Below is the code:
#convert to markdown
from google.cloud import storage
import tempfile

storage_client = storage.Client()

gcs_path = file_path
bucket_name = bucket_name
blob_name = f"{source_folder}/{file_name}"

bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)

print(bucket)
print(blob_name)

temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.docx')
print(temp_file)
blob.download_to_filename(temp_file.name)

doc_converter = DocumentConverter(
allowed_formats=[InputFormat.DOCX],
format_options={
InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline),
},
)

result = doc_converter.convert(temp_file.name)
markdown_file_np = result.document.export_to_markdown()
print(markdown_file_np)

maxmnemonic · 2024-11-25T16:13:07Z

I think the most likely problem here is that word file includes an image, blob of which can't be loaded by PIL library.
Error should trigger, even if file would be fully local.

@acsankar, any chance you could make an example file for this error?

maxmnemonic · 2024-11-25T19:24:26Z

Draft of the PR that should fix the error: #432

maxmnemonic · 2024-11-26T15:21:29Z

@acsankar Fresh release of Docling 2.7.1, includes fixes!

Tendo33 · 2024-12-10T09:51:18Z

This error also occurs in PPTX, and I am currently using this version：

docling                   2.10.0
docling-core              2.9.0
docling-ibm-models        2.0.7
docling-parse             3.0.0

The error message is:

    conv_res = pptx_converter.convert(pptx_file_path)
  File "/opt/mfapi/lib/python3.10/site-packages/pydantic/validate_call_decorator.py", line 59, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
  File "/opt/mfapi/lib/python3.10/site-packages/pydantic/_internal/_validate_call.py", line 81, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 172, in convert
    return next(all_res)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 193, in convert_all
    for conv_res in conv_res_iter:
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 228, in _convert
    for item in map(
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 269, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/document_converter.py", line 292, in _execute_pipeline
    conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 52, in execute
    raise e
  File "/opt/mfapi/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 44, in execute
    conv_res = self._build_document(conv_res)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/pipeline/simple_pipeline.py", line 41, in _build_document
    conv_res.document = conv_res.input._backend.convert()
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 97, in convert
    doc = self.walk_linear(self.pptx_obj, doc)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 406, in walk_linear
    handle_shapes(shape, parent_slide, slide_ind, doc)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 384, in handle_shapes
    self.handle_pictures(shape, parent_slide, slide_ind, doc)
  File "/opt/mfapi/lib/python3.10/site-packages/docling/backend/mspowerpoint_backend.py", line 285, in handle_pictures
    image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
  File "/opt/mfapi/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 484, in from_pil
    image.save(buffered, format="PNG")
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/Image.py", line 2528, in save
    self._ensure_mutable()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/Image.py", line 639, in _ensure_mutable
    self._copy()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/Image.py", line 632, in _copy
    self.load()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/WmfImagePlugin.py", line 163, in load
    return super().load()
  File "/opt/mfapi/lib/python3.10/site-packages/PIL/ImageFile.py", line 377, in load
    raise OSError(msg)
OSError: cannot find loader for this WMF file

I believe you might need to make such considerations for files of all formats.🤣
@maxmnemonic

acsankar added the question Further information is requested label Nov 22, 2024

dolfim-ibm added the docx issue related to docx backend label Nov 25, 2024

dolfim-ibm assigned maxmnemonic Nov 25, 2024

cau-git mentioned this issue Nov 25, 2024

using docx format in cloud DS4SD/docling-core#72

Closed

maxmnemonic mentioned this issue Nov 25, 2024

fix: Fixes for wordx #432

Merged

3 tasks

maxmnemonic closed this as completed in #432 Nov 26, 2024

Tendo33 mentioned this issue Dec 24, 2024

fix(mspowerpoint): handle invalid images in PowerPoint slides #650

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using .DOCX format in cloud - suggestion on the below error? #410

Using .DOCX format in cloud - suggestion on the below error? #410

acsankar commented Nov 22, 2024

PeterStaar-IBM commented Nov 22, 2024

acsankar commented Nov 22, 2024

maxmnemonic commented Nov 25, 2024

maxmnemonic commented Nov 25, 2024

maxmnemonic commented Nov 26, 2024

Tendo33 commented Dec 10, 2024 •

edited

Loading

Using .DOCX format in cloud - suggestion on the below error? #410

Using .DOCX format in cloud - suggestion on the below error? #410

Comments

acsankar commented Nov 22, 2024

PeterStaar-IBM commented Nov 22, 2024

acsankar commented Nov 22, 2024

<tempfile._TemporaryFileWrapper object at 0x7bfe9c3b00d0> /tmp/tmpxytgsj20.docx

maxmnemonic commented Nov 25, 2024

maxmnemonic commented Nov 25, 2024

maxmnemonic commented Nov 26, 2024

Tendo33 commented Dec 10, 2024 • edited Loading

<tempfile._TemporaryFileWrapper object at 0x7bfe9c3b00d0>
/tmp/tmpxytgsj20.docx

Tendo33 commented Dec 10, 2024 •

edited

Loading