BUG: layout mode text extraction ZeroDivisionError #2417

shartzog · 2024-01-20T04:59:24Z

For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0.

Discovered during processing of a "pre-OCR'd" image PDF having {"/BaseFont": "/GlyphLessFont"}.

Remove duplicate docstring for layout_mode_strip_rotated

For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0. Discovered during processing of a "pre-OCR'd" image PDF having `{"/BaseFont": "/GlyphLessFont"}`. Remove duplicate docstring for layout_mode_strip_rotated

shartzog · 2024-01-20T05:02:15Z

Sorry for the quick patch, @MartinThoma, but we picked up a new client with "pre-OCR'd" image PDFs that contained a lot of handwritten text and this error popped up. Nothing urgent so feel free to sit on it for a bit. Just wanted to get it out there while it was top of mind.

codecov · 2024-01-20T05:05:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (26b9a97) 94.42% compared to head (b460bd9) 94.43%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2417   +/-   ##
=======================================
  Coverage   94.42%   94.43%           
=======================================
  Files          49       49           
  Lines        8007     8008    +1     
  Branches     1616     1616           
=======================================
+ Hits         7561     7562    +1     
  Misses        276      276           
  Partials      170      170

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

float 0.0 is already `falsy` and only a "true zero" float results in the ZeroDivisionError. I.e. the int() conversion isn't needed and will likely cause more harm than good.

MartinThoma · 2024-01-20T07:00:15Z

I guess you tested the change with a private PDF that has this property?

Sorry for the quick patch

No worries, I will never complain about any contribution that improves pypdf 😄

shartzog · 2024-01-20T19:00:57Z

I guess you tested the change with a private PDF that has this property?

Yes, sorry. The offenders currently at my disposal all contain protected health information. I'll see if I can get our client to scan something over that doesn't. If so, I'll add a test case, but I'd put the odds of them getting back to me on that at ~50/50.

MartinThoma · 2024-01-21T10:57:29Z

Thank you!

I've merged the change as it provides value and I trust you that you have tested it. It will be released latest next Sunday.

Adding a test (to the sample-files repository) will ensure that we don't re-introduce this issue.

shartzog · 2024-01-23T00:37:24Z

Thank you!

I've merged the change as it provides value and I trust you that you have tested it. It will be released latest next Sunday.

Adding a test (to the sample-files repository) will ensure that we don't re-introduce this issue.

Thanks! Sounds good.

@shartzog

## What's new ### Bug Fixes (BUG) - layout mode text extraction ZeroDivisionError (#2417) by @shartzog ### Testing (TST) - Skip tests using fpdf2 if it\'s not installed (#2419) by @MartinThoma [Full Changelog](4.0.0...4.0.1)

Remove int wrapper

b460bd9

float 0.0 is already `falsy` and only a "true zero" float results in the ZeroDivisionError. I.e. the int() conversion isn't needed and will likely cause more harm than good.

MartinThoma approved these changes Jan 21, 2024

View reviewed changes

MartinThoma merged commit 9e494c6 into py-pdf:main Jan 21, 2024
15 checks passed

MartinThoma deleted the layout-mode-zero-div-patch branch January 21, 2024 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: layout mode text extraction ZeroDivisionError #2417

BUG: layout mode text extraction ZeroDivisionError #2417

shartzog commented Jan 20, 2024

shartzog commented Jan 20, 2024

codecov bot commented Jan 20, 2024 •

edited

Loading

MartinThoma commented Jan 20, 2024 •

edited

Loading

shartzog commented Jan 20, 2024

MartinThoma commented Jan 21, 2024

shartzog commented Jan 23, 2024

BUG: layout mode text extraction ZeroDivisionError #2417

BUG: layout mode text extraction ZeroDivisionError #2417

Conversation

shartzog commented Jan 20, 2024

shartzog commented Jan 20, 2024

codecov bot commented Jan 20, 2024 • edited Loading

Codecov Report

MartinThoma commented Jan 20, 2024 • edited Loading

shartzog commented Jan 20, 2024

MartinThoma commented Jan 21, 2024

shartzog commented Jan 23, 2024

codecov bot commented Jan 20, 2024 •

edited

Loading

MartinThoma commented Jan 20, 2024 •

edited

Loading