Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/parsing pdf error - new_cells as str has no "copy" #3119

Closed
mpierangeli-q99 opened this issue May 30, 2024 · 6 comments · Fixed by #3130
Closed

bug/parsing pdf error - new_cells as str has no "copy" #3119

mpierangeli-q99 opened this issue May 30, 2024 · 6 comments · Fixed by #3130
Labels
awaiting-response bug Something isn't working pdf

Comments

@mpierangeli-q99
Copy link

Bug Description

After parsing hundred of similar pdfs successfully, an AttributeError emerged in one of them (not particularly different, just another brochure of a company product).

Error Output

_File "/usr/local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 570, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 622, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 83, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 221, in partition_pdf
return partition_pdf_or_image(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 312, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 516, in partition_pdf_or_image_local
final_document_layout = process_data_with_ocr(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 85, in process_data_with_ocr
merged_layouts = process_file_with_ocr(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 181, in process_file_with_ocr
raise e
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 169, in process_file_with_ocr
merged_page_layout = supplement_page_layout_with_ocr(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 247, in supplement_page_layout_with_ocr
page_layout.elements[:] = supplement_element_with_table_extraction(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/utils.py", line 220, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py", line 297, in supplement_element_with_table_extraction
text_as_html = cells_to_html(tatr_cells)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 675, in cells_to_html
cells = sorted(fill_cells(cells), key=lambda k: (min(k["row_nums"]), min(k["column_nums"])))
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/unstructured_inference/models/tables.py", line 663, in fill_cells
new_cells = cells.copy()
^^^^^^^^^^
AttributeError: 'str' object has no attribute 'copy'

Code Snippet

raw_pdf_elements = partition_pdf(
                        file=pdf_content,
                        extract_images_in_pdf=True,
                        infer_table_structure=True,
                        chunking_strategy="by_title",
                        max_characters=CHUNK_LENGTH, 
                        new_after_n_chars=CHUNK_LENGTH * 3,
                        combine_text_under_n_chars=CHUNK_LENGTH,  
                        extract_image_block_output_dir=temp_path,
                    )

Environment Info
I'm running this on python 3.11
onnx==1.16.1
pdf2image==1.17.0
pdfplumber==0.11.0
pdfminer.six==20231228
pillow_heif==0.16.0
pikepdf==8.15.1
opencv-python==4.9.0.80
unstructured-client==0.22.0
unstructured-inference==0.7.29
unstructured.pytesseract==0.3.12

@mpierangeli-q99 mpierangeli-q99 added the bug Something isn't working label May 30, 2024
@MthwRobinson
Copy link
Contributor

Hi @mpierangeli-q99 - Are you able to provide an example document we could use to reproduce the error?

@mpierangeli-q99
Copy link
Author

mpierangeli-q99 commented May 30, 2024

testing_brochure_1.pdf
Hi @MthwRobinson this is the pdf in question. Ty (edit: wrong file)

@scanny scanny added the pdf label May 30, 2024
@christinestraub
Copy link
Collaborator

Hi @mpierangeli-q99, Are you using the latest versions of unstructured(0.14.3) and unstructured-inference(0.7.34) libraries? I did not get those errors in those versions.

$ pip install unstructured -U
$ pip install unstructured-inference -U
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        extract_images_in_pdf=True,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
        extract_image_block_output_dir=".",
    )

print("\n\n".join([str(el) for el in elements]))

@mpierangeli-q99
Copy link
Author

Hi @christinestraub i think i confused the file, because that one is working.
testing_brochure_2.pdf
This one I'm sure doesn't work.
FYI
unstructured==0.13.6
unstructured-inference==0.7.29

@christinestraub
Copy link
Collaborator

Hi @mpierangeli-q99, I created a PR for a quick fix - #3130. The error occurred because the table is not recognized in the open-source version. I recommend using the API for improved table extraction performance.

github-merge-queue bot pushed a commit that referenced this issue Jun 3, 2024
Closes #3119.

### Testing
Parsing the provided PDF should be successful.


[testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf)
```
filename = "testing_brochure_2.pdf"
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        infer_table_structure=True,
        extract_image_block_types=["Image", "Table"],
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
    )
print("\n\n".join([str(el) for el in elements]))
```
@mpierangeli-q99
Copy link
Author

Ty @christinestraub !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-response bug Something isn't working pdf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants