-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: layout mode text extraction ZeroDivisionError #2417
Conversation
For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0. Discovered during processing of a "pre-OCR'd" image PDF having `{"/BaseFont": "/GlyphLessFont"}`. Remove duplicate docstring for layout_mode_strip_rotated
Sorry for the quick patch, @MartinThoma, but we picked up a new client with "pre-OCR'd" image PDFs that contained a lot of handwritten text and this error popped up. Nothing urgent so feel free to sit on it for a bit. Just wanted to get it out there while it was top of mind. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2417 +/- ##
=======================================
Coverage 94.42% 94.43%
=======================================
Files 49 49
Lines 8007 8008 +1
Branches 1616 1616
=======================================
+ Hits 7561 7562 +1
Misses 276 276
Partials 170 170 ☔ View full report in Codecov by Sentry. |
float 0.0 is already `falsy` and only a "true zero" float results in the ZeroDivisionError. I.e. the int() conversion isn't needed and will likely cause more harm than good.
I guess you tested the change with a private PDF that has this property?
No worries, I will never complain about any contribution that improves pypdf 😄 |
Yes, sorry. The offenders currently at my disposal all contain protected health information. I'll see if I can get our client to scan something over that doesn't. If so, I'll add a test case, but I'd put the odds of them getting back to me on that at ~50/50. |
Thank you! I've merged the change as it provides value and I trust you that you have tested it. It will be released latest next Sunday. Adding a test (to the sample-files repository) will ensure that we don't re-introduce this issue. |
Thanks! Sounds good. |
## What's new ### Bug Fixes (BUG) - layout mode text extraction ZeroDivisionError (#2417) by @shartzog ### Testing (TST) - Skip tests using fpdf2 if it\'s not installed (#2419) by @MartinThoma [Full Changelog](4.0.0...4.0.1)
For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0.
Discovered during processing of a "pre-OCR'd" image PDF having
{"/BaseFont": "/GlyphLessFont"}
.Remove duplicate docstring for layout_mode_strip_rotated