Skip to content

Commit

Permalink
dataprep: Fix issue in uploading docx with embedding image
Browse files Browse the repository at this point in the history
Fix issue opea-project#407

Signed-off-by: Lianhao Lu <[email protected]>
  • Loading branch information
lianhao committed Aug 29, 2024
1 parent 2360e5a commit d668956
Show file tree
Hide file tree
Showing 8 changed files with 8 additions and 9 deletions.
2 changes: 1 addition & 1 deletion comps/dataprep/milvus/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ python-pptx
sentence_transformers
shortuuid
tiktoken
unstructured[all-docs]==0.11.5
unstructured[all-docs]==0.15.7
uvicorn
2 changes: 1 addition & 1 deletion comps/dataprep/pgvector/langchain/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@ python-pptx
sentence_transformers
shortuuid
tiktoken
unstructured[all-docs]==0.11.5
unstructured[all-docs]==0.15.7
uvicorn

2 changes: 1 addition & 1 deletion comps/dataprep/pinecone/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ python-docx
python-pptx
sentence_transformers
shortuuid
unstructured[all-docs]==0.11.5
unstructured[all-docs]==0.15.7
uvicorn
2 changes: 1 addition & 1 deletion comps/dataprep/qdrant/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,5 @@ python-pptx
qdrant-client
sentence_transformers
shortuuid
unstructured[all-docs]==0.11.5
unstructured[all-docs]==0.15.7
uvicorn
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@ services:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
REDIS_HOST: ${REDIS_HOST}
REDIS_PORT: ${REDIS_PORT}
REDIS_URL: ${REDIS_URL}
INDEX_NAME: ${INDEX_NAME}
TEI_ENDPOINT: ${TEI_ENDPOINT}
Expand Down
2 changes: 1 addition & 1 deletion comps/dataprep/redis/langchain/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ python-pptx
redis
sentence_transformers
shortuuid
unstructured[all-docs]==0.11.5
unstructured[all-docs]==0.15.7
uvicorn
1 change: 1 addition & 0 deletions comps/dataprep/redis/langchain_ray/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,6 @@ ray
redis
sentence_transformers
shortuuid
unstructured[all-docs]==0.15.7
uvicorn
virtualenv
4 changes: 2 additions & 2 deletions comps/dataprep/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import shutil
import signal
import subprocess
import tempfile
import timeit
import unicodedata
import urllib.parse
Expand Down Expand Up @@ -187,8 +188,7 @@ def load_docx(docx_path):
if isinstance(r._target, docx.parts.image.ImagePart):
rid2img[r.rId] = os.path.basename(r._target.partname)
if rid2img:
save_path = "./imgs/"
os.makedirs(save_path, exist_ok=True)
save_path = tempfile.mkdtemp()
docx2txt.process(docx_path, save_path)
for paragraph in doc.paragraphs:
if hasattr(paragraph, "text"):
Expand Down

0 comments on commit d668956

Please sign in to comment.