Refactor: remove `pdfminer` related code #294

christinestraub · 2023-11-21T21:48:16Z

Summary

This PR is the first part of pdfminer refactor to move it from unstructured-inference repo to unstructured repo. This PR removes all pdfminer related code from unstructured-inference repo and works together with the unstructured refactor PR - Unstructured-IO/unstructured#2158.

Note

The ingest test won't pass until we merge the unstructured refactor PR - Unstructured-IO/unstructured#2158.

TODO

image extraction refactor to move it from unstructured-inference repo to unstructured repo

… layout elements

* test: fix lint errors

… visualization purposes

# Conflicts: # CHANGELOG.md # unstructured_inference/__version__.py

…ayout()`

benjats07

LGTM

…2158) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in Unstructured-IO/unstructured-inference#294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`

feat: update unstructured-inference library to return only inferred…

92fb519

… layout elements

christinestraub changed the title ~~feat: update unstructured-inference library to return only inferred…~~ Refactor: remove text extraction (pdfminer) related code Nov 21, 2023

christinestraub added 13 commits November 24, 2023 13:26

refactor: remove load_pdf() and related test functions

9f73f0e

refactor: remove constant Source.PDFMINER

5633ff3

chore: update changelog & version

cad6815

* refactor: remove use of PageLayout.layout

c3f1ceb

* test: fix lint errors

test: fix lint errors

3525db8

test: fix unit test errors

23c5186

refactor: remove get_images_from_pdf_element()

f944763

refactor: inferred layout ordering

70e623e

refactor: remove analysis flag, which was used for debugging/layout…

033ca2b

… visualization purposes

test: fix unit test errors

87cb3f9

Merge branch 'main' into refactor/remove_pdfminer_code

7cd59ba

# Conflicts: # CHANGELOG.md # unstructured_inference/__version__.py

refactor: remove code related to pdfminer patch

10266cb

test: add a test function for `merge_inferred_layout_with_extracted_l…

2f9a432

…ayout()`

christinestraub marked this pull request as ready for review November 30, 2023 06:40

christinestraub requested review from qued, yuming-long, benjats07 and cragwolfe November 30, 2023 06:41

christinestraub mentioned this pull request Nov 30, 2023

Refactor: support merging extracted layout with inferred layout Unstructured-IO/unstructured#2158

Merged

chore: update changelog & version

b1a8feb

benjats07 approved these changes Dec 1, 2023

View reviewed changes

cragwolfe merged commit 2b29254 into main Dec 1, 2023
5 of 8 checks passed

cragwolfe deleted the refactor/remove_pdfminer_code branch December 1, 2023 05:37

christinestraub changed the title ~~Refactor: remove text extraction (pdfminer) related code~~ Refactor: remove pdfminer related code Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: remove `pdfminer` related code #294

Refactor: remove `pdfminer` related code #294

christinestraub commented Nov 21, 2023 •

edited

Loading

benjats07 left a comment

Refactor: remove pdfminer related code #294

Refactor: remove pdfminer related code #294

Conversation

christinestraub commented Nov 21, 2023 • edited Loading

Summary

Note

TODO

benjats07 left a comment

Choose a reason for hiding this comment

Refactor: remove `pdfminer` related code #294

Refactor: remove `pdfminer` related code #294

christinestraub commented Nov 21, 2023 •

edited

Loading