-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Extract Text Enhancement (whitespaces) #1084
Conversation
2 types of fix : a) ignore comments - starting with % b) extend utf16 for min of 4 characters testfile to be added checking https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf. checking pages [0, 5, 8, 9, 12, 13, 14, 15, 17, 18, 20, 23, 24, 26, 27, 29, 30, 35, 38, 50, 52, 60, 61, 72, 87, 88, 90, 93, 94, 96, 99, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 118, 123, 124, 125, 126, 128, 129, 132, 133, 156, 160, 162, 178, 189, 195, 198, 235, 254, 255, 256, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 275, 276, 277, 278, 280, 281, 282, 283, 285, 287, 290, 292, 293, 295, 296, 297, 298, 299, 301, 302, 306, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 326, 327, 329, 330, 331, 332, 333, 336, 338, 339, 340, 341, 342, 344, 345, 346, 367, 368, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 381, 382, 383, 384, 385, 386, 387, 388, 404, 405, 406, 414, 415, 416, 417, 419, 420]
Codecov Report
@@ Coverage Diff @@
## main #1084 +/- ##
==========================================
+ Coverage 91.64% 91.94% +0.30%
==========================================
Files 24 24
Lines 4559 4642 +83
Branches 932 957 +25
==========================================
+ Hits 4178 4268 +90
+ Misses 241 229 -12
- Partials 140 145 +5
Continue to review full report at Codecov.
|
Running the benchmark, we have one outlier which is way worse and the rest being similar:
|
Hence PyPDF2 now extracts a lot of |
@pubpub-zz Should I test again? I'm always excited when I see a PR from you 😄 |
You can : just normally to fix the "underscores" |
The latest results:
|
@pubpub-zz To me, this looks good 👍 If you think it's ready, I would merge it :-) |
Let me check tonight it covers most/all opened issues remaining |
Please have a look at your first comment - I've edited it to contain what I would use as a squash commit message. Feel free to adjust it :-) |
Let me check tonight it covers most/all opened issues remaining |
Interestingly, the multiplication error didn't make a noticable difference
|
Don't think I will be able to go further for the moment. |
Nice! Thank you so much! I'll merge tomorrow morning :-) |
@pubpub-zz Your PR is now part of Thank you for improving the text extraction once again 🤗 |
* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding py-pdf#1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves py-pdf#234 Improves py-pdf#957 Closes py-pdf#1003 Closes py-pdf#1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0
if isinstance(operands[0], str): | ||
text += operands[0] | ||
else: | ||
t: str = "" | ||
tt: bytes = ( | ||
encode_pdfdocencoding(operands[0]) | ||
if isinstance(operands[0], str) | ||
else operands[0] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second isinstance(operands[0], str)
branch looks unreachable here (since moved over here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Do you want to submit a corresponding PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefan6419846 At least not immediately. Clearly there was or is something to that code that I do not understand enough to just delete it from a version that no longer contains the explanation - #2440 was merged much later!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@biredel
I'm confused :
you are looking at an obsolete branch (PyPDF2) instead of pypdf.
The code seems to be this one
https://github.com/py-pdf/pypdf/blob/a435eaaa08c71e3f66320edd06be24637ef32986/pypdf/_text_extraction/__init__.py#L225C18-L234C18
Codecov indicates some test coverage.
Improves #234
Improves #957
Closes #1003
Closes #1019
Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing