Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/chipper repetitions #295

Merged
merged 22 commits into from
Dec 20, 2023
Merged

Feat/chipper repetitions #295

merged 22 commits into from
Dec 20, 2023

Conversation

ajjimeno
Copy link
Contributor

In some cases Chipper repeats elements. This PR has additional mechanisms to detect these repetitions and provides mechanisms for filtering repetitions that cannot be identified during decoding.

Repetition detection:

  • Tables benefit from beam search size = 3. When a table is detected using beam search size = 1, the generation restarts with beam search size = 3
  • To avoid interacting with genuine repetitions, a context windows has been defined to avoid looking for repeated text in all the generated elements.
  • Specific mechanism has been added to detect when repetitions happen in tables, which were not detected before.

Additional filtering:

  • Remove empty tables
  • Remove Picture elements that get repeated
  • Remove repeated texts, this uses the bounding boxes and the matching on the elements text to identify repetitions

Repetitions are not easy to reproduce, but I would suggest selecting documents that are not part of Odetta annotation or available in this repository for additional testing. Additional testing in different environments (e.g. Linux, Mac OS) is recommended.

@ajjimeno ajjimeno marked this pull request as ready for review November 26, 2023 22:29
@cragwolfe
Copy link
Contributor

please add test instructions. maybe just calling paritition(model_name="chipperv2") on a particular image where an improvement is visible?

@cragwolfe cragwolfe requested a review from mengdih November 27, 2023 05:25
@ajjimeno
Copy link
Contributor Author

Here are two example images. One is processed as a table, since it was modelled like that in the ground truth, the problem is that the table is never finished, chipper generates unlimited "" pair of tokens. The second has some repetitions that may happen due to the images in the document. With the proposed PR, these repetitions disappear.

The PDF document made Chipper generate repetitions but only under Linux.

Example code for the images.

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

image_file_name = [change with image path]

model = get_model("chipper")
doc = DocumentLayout.from_image_file(
    image_file_name,
    detection_model=model,
)

print(*[element.__dict__ for element in doc.pages[0].elements])

print(*[element for element in doc.pages[0].elements], sep="\n")

Example code for the PDF file:

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

pdf_file_name = [change with PDF path]

model = get_model("chipper")
doc = DocumentLayout.from_file(
    pdf_file_name,
    detection_model=model,
    pdf_image_dpi=300,
 )

print(*[element.__dict__ for element in doc.pages[0].elements])

print(*[element for element in doc.pages[0].elements], sep="\n")

RAND_RRA2977-1.pdf

46
42044487_0fb229dd-commoncrawl_blogannettepehrssonsezenit-b-volume-thirml_2

CHANGELOG.md Outdated
@@ -1,3 +1,7 @@
## 0.7.21-dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should be a release again to get into unstructured

@mengdih
Copy link

mengdih commented Dec 20, 2023

Tested with chipperv2, the repetition problem still exists. The issue could be from the new changes in main branch. Have discussed this with @ajjimeno .

@ajjimeno
Copy link
Contributor Author

Fixed the issue with the merging and added additional test cases. @mengdih please check again. Thanks!

@ajjimeno ajjimeno merged commit 54e3e46 into main Dec 20, 2023
5 of 8 checks passed
@ajjimeno ajjimeno deleted the feat/chipper-repetitions branch December 20, 2023 22:05
ajjimeno added a commit that referenced this pull request Dec 20, 2023
ajjimeno added a commit that referenced this pull request Dec 20, 2023
@ajjimeno ajjimeno restored the feat/chipper-repetitions branch January 2, 2024 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants