Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix IndexError when partioning a pdf with starting_page_number #3246

Merged
merged 8 commits into from
Jun 19, 2024

Conversation

awalker4
Copy link
Contributor

The Issue:

When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via starting_page_number. To get the true page index, we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements.

from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)

The Issue:

When extracting images from pdfs, we use the metadata page number to index into a list of the
images. However, the metadata page number can now be changed via `starting_page_number`. To get the
true page index, we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it
will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```
@awalker4 awalker4 requested a review from christinestraub June 19, 2024 00:29
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just requesting one additional test assertion.

@@ -1223,6 +1223,8 @@ def test_partition_pdf_element_extraction(
if file_mode == "filename":
elements = pdf.partition_pdf(
filename=filename,
# Image extraction shouldn't break by setting this
starting_page_number=20,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a test that asserts that the resulting page number in metadata is correct?

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MthwRobinson MthwRobinson enabled auto-merge June 19, 2024 17:21
Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MthwRobinson MthwRobinson changed the title fix/fix IndexError when partioning a pdf with starting_page_number fix: fix IndexError when partioning a pdf with starting_page_number Jun 19, 2024
@MthwRobinson MthwRobinson added this pull request to the merge queue Jun 19, 2024
Merged via the queue into main with commit 0b73978 Jun 19, 2024
50 checks passed
@MthwRobinson MthwRobinson deleted the fix/extract-images-page-number branch June 19, 2024 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants