Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: remove pdfminer related code #294

Merged
merged 15 commits into from
Dec 1, 2023
Merged

Conversation

christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Nov 21, 2023

Summary

This PR is the first part of pdfminer refactor to move it from unstructured-inference repo to unstructured repo. This PR removes all pdfminer related code from unstructured-inference repo and works together with the unstructured refactor PR - Unstructured-IO/unstructured#2158.

Note

The ingest test won't pass until we merge the unstructured refactor PR - Unstructured-IO/unstructured#2158.

TODO

  • image extraction refactor to move it from unstructured-inference repo to unstructured repo

@christinestraub christinestraub changed the title feat: update unstructured-inference library to return only inferred… Refactor: remove text extraction (pdfminer) related code Nov 21, 2023
Copy link
Contributor

@benjats07 benjats07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cragwolfe cragwolfe merged commit 2b29254 into main Dec 1, 2023
5 of 8 checks passed
@cragwolfe cragwolfe deleted the refactor/remove_pdfminer_code branch December 1, 2023 05:37
github-merge-queue bot pushed a commit to Unstructured-IO/unstructured that referenced this pull request Dec 1, 2023
…2158)

### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
Unstructured-IO/unstructured-inference#294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
@christinestraub christinestraub changed the title Refactor: remove text extraction (pdfminer) related code Refactor: remove pdfminer related code Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants