-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate strings assigned to the same cell #15
Comments
I believe this occurs when bold characters are created by putting duplicate characters instead of widdening the character. I've noticed it often creates 4 copies of each, although in your example it is 2x. That implies it might be at the pdf level. I think it might be at the pdf level because these bold characters don't have any difference in terms of font and other characteristics. |
In addition, this is made worse by the fact in some duplicates, the LTHorizontal Object splits the line into two, and in some duplicates it is not split. |
Yep, facing the same issue. |
There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals. |
Can you please guide me on how I would do that? |
You need to change the source code so this isn't a great task if you're not comfortable with programming. Whenever you see horizontals = get_text_objects(ltype=LThorizontal), you can do the following code to delete horizontals.
If anyone notices cases that this does not cover, please let me know. |
Thanks, I'll try this out and get back to you! |
sometimes text is stacked on top of each other intentionally, this doesn't adjust for that |
Yes! Let me see if I can get this into the library. Would you like to raise a PR with a corresponding test with the example PDF?
Yes. |
Hi guys, I sent a PR with a working solution to the issue. I added a unittest with the PDF file mentioned in the first comment. |
[MRG] Fix #15 extraction of cell data discarding overlapping text boxes
@edugonza Thank you for fixing this! The PR looked good! Thank you for adding a test too 👍 I'll start working on a release soon. |
Can't wait. Any idea when it will be released? |
Check out this birdisland.pdf output here.
The text was updated successfully, but these errors were encountered: