fix: fix `IndexError` when partioning a pdf with `starting_page_number` #3246

awalker4 · 2024-06-19T00:29:49Z

The Issue:

When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via starting_page_number. To get the true page index, we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements.

from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)

The Issue: When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via `starting_page_number`. To get the true page index, we need to subtract this value. Testing: Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements. ``` from unstructured.partition.auto import partition filename = "example-docs/layout-parser-paper-with-table.pdf" partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20) ```

MthwRobinson

Looks good! Just requesting one additional test assertion.

MthwRobinson · 2024-06-19T16:58:28Z

test_unstructured/partition/pdf_image/test_pdf.py

@@ -1223,6 +1223,8 @@ def test_partition_pdf_element_extraction(
        if file_mode == "filename":
            elements = pdf.partition_pdf(
                filename=filename,
+                # Image extraction shouldn't break by setting this
+                starting_page_number=20,


Could we also add a test that asserts that the resulting page number in metadata is correct?

MthwRobinson

LGTM

christinestraub

LGTM!

awalker4 requested a review from christinestraub June 19, 2024 00:29

awalker4 and others added 3 commits June 18, 2024 21:26

linter fix

9161749

Merge branch 'main' into fix/extract-images-page-number

1165cab

Sync version

6aad58d

awalker4 temporarily deployed to ci June 19, 2024 11:50 — with GitHub Actions Inactive

MthwRobinson suggested changes Jun 19, 2024

View reviewed changes

add addition test assertion

3f4d75c

MthwRobinson approved these changes Jun 19, 2024

View reviewed changes

MthwRobinson enabled auto-merge June 19, 2024 17:21

fix test

ef349c8

christinestraub approved these changes Jun 19, 2024

View reviewed changes

fix broken test

e2ac8dd

MthwRobinson temporarily deployed to ci June 19, 2024 17:35 — with GitHub Actions Inactive

refactor addition test assertion

c7cf2dc

christinestraub temporarily deployed to ci June 19, 2024 17:47 — with GitHub Actions Inactive

christinestraub temporarily deployed to ci June 19, 2024 17:48 — with GitHub Actions Inactive

MthwRobinson changed the title ~~fix/fix IndexError when partioning a pdf with starting_page_number~~ fix: fix IndexError when partioning a pdf with starting_page_number Jun 19, 2024

MthwRobinson added this pull request to the merge queue Jun 19, 2024

Merged via the queue into main with commit 0b73978 Jun 19, 2024
50 checks passed

MthwRobinson deleted the fix/extract-images-page-number branch June 19, 2024 19:01

awalker4 mentioned this pull request Jun 24, 2024

IndexError: list index out of range while extracting images from pdf? Unstructured-IO/unstructured-api#432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix `IndexError` when partioning a pdf with `starting_page_number` #3246

fix: fix `IndexError` when partioning a pdf with `starting_page_number` #3246

awalker4 commented Jun 19, 2024

MthwRobinson left a comment

MthwRobinson Jun 19, 2024

MthwRobinson left a comment

christinestraub left a comment

fix: fix IndexError when partioning a pdf with starting_page_number #3246

fix: fix IndexError when partioning a pdf with starting_page_number #3246

Conversation

awalker4 commented Jun 19, 2024

MthwRobinson left a comment

Choose a reason for hiding this comment

MthwRobinson Jun 19, 2024

Choose a reason for hiding this comment

MthwRobinson left a comment

Choose a reason for hiding this comment

christinestraub left a comment

Choose a reason for hiding this comment

fix: fix `IndexError` when partioning a pdf with `starting_page_number` #3246

fix: fix `IndexError` when partioning a pdf with `starting_page_number` #3246