-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/chipper repetitions #295
Conversation
please add test instructions. maybe just calling |
Here are two example images. One is processed as a table, since it was modelled like that in the ground truth, the problem is that the table is never finished, chipper generates unlimited "" pair of tokens. The second has some repetitions that may happen due to the images in the document. With the proposed PR, these repetitions disappear. The PDF document made Chipper generate repetitions but only under Linux. Example code for the images.
Example code for the PDF file:
|
CHANGELOG.md
Outdated
@@ -1,3 +1,7 @@ | |||
## 0.7.21-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should be a release again to get into unstructured
Tested with chipperv2, the repetition problem still exists. The issue could be from the new changes in main branch. Have discussed this with @ajjimeno . |
Fixed the issue with the merging and added additional test cases. @mengdih please check again. Thanks! |
In some cases Chipper repeats elements. This PR has additional mechanisms to detect these repetitions and provides mechanisms for filtering repetitions that cannot be identified during decoding.
Repetition detection:
Additional filtering:
Repetitions are not easy to reproduce, but I would suggest selecting documents that are not part of Odetta annotation or available in this repository for additional testing. Additional testing in different environments (e.g. Linux, Mac OS) is recommended.