-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/parsing pdf error - new_cells as str has no "copy" #3119
Comments
Hi @mpierangeli-q99 - Are you able to provide an example document we could use to reproduce the error? |
testing_brochure_1.pdf |
Hi @mpierangeli-q99, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.
|
Hi @christinestraub i think i confused the file, because that one is working. |
Hi @mpierangeli-q99, I created a PR for a quick fix - #3130. The error occurred because the table is not recognized in the open-source version. I recommend using the API for improved table extraction performance. |
Closes #3119. ### Testing Parsing the provided PDF should be successful. [testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf) ``` filename = "testing_brochure_2.pdf" with open(filename, "rb") as pdf_content: elements = partition_pdf( file=pdf_content, infer_table_structure=True, extract_image_block_types=["Image", "Table"], chunking_strategy="by_title", max_characters=1000, new_after_n_chars=3000, combine_text_under_n_chars=1000, ) print("\n\n".join([str(el) for el in elements])) ```
Ty @christinestraub ! |
Bug Description
After parsing hundred of similar pdfs successfully, an AttributeError emerged in one of them (not particularly different, just another brochure of a company product).
Error Output
_File "/usr/local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 570, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 622, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 83, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 221, in partition_pdf
return partition_pdf_or_image(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 312, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 516, in partition_pdf_or_image_local
final_document_layout = process_data_with_ocr(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 85, in process_data_with_ocr
merged_layouts = process_file_with_ocr(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 181, in process_file_with_ocr
raise e
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 169, in process_file_with_ocr
merged_page_layout = supplement_page_layout_with_ocr(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 247, in supplement_page_layout_with_ocr
page_layout.elements[:] = supplement_element_with_table_extraction(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 297, in supplement_element_with_table_extraction
text_as_html = cells_to_html(tatr_cells)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 675, in cells_to_html
cells = sorted(fill_cells(cells), key=lambda k: (min(k["row_nums"]), min(k["column_nums"])))
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 663, in fill_cells
new_cells = cells.copy()
^^^^^^^^^^
AttributeError: 'str' object has no attribute 'copy'
Code Snippet
Environment Info
I'm running this on python 3.11
onnx==1.16.1
pdf2image==1.17.0
pdfplumber==0.11.0
pdfminer.six==20231228
pillow_heif==0.16.0
pikepdf==8.15.1
opencv-python==4.9.0.80
unstructured-client==0.22.0
unstructured-inference==0.7.29
unstructured.pytesseract==0.3.12
The text was updated successfully, but these errors were encountered: