-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Issue in text extraction (spaces) (#1153) #2882
Conversation
This reverts commit 5400f5a.
This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test
…nt size comparison to ratio
Co-authored-by: Stefan <[email protected]>
…he assertion process
I forgot one more thing. I don't understand what the pypdf/pypdf/_text_extraction/__init__.py Line 136 in c8220c6
|
this is a long time ago and I must admit that I cannot remember why this value. will try to gather my memories |
Sorry, I added the code for CIDFont, but it seems to need mapping to Also, I am calculating the font each time, which is very inefficient. It doesn't seem to support vertical type characters, but since it seems to be from the original, I don't intend to include that in this PR. |
CIDFont is supported based on the PDF in the following Issue. Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336 The following will remain, but the spaces are moving correctly (because the letters are shaved, the distance traveled is greater than the letters, and the spaces are showing) Expected: This is probably analogous to the following problem. Dealing with the calculation of fonts each time will be revised later. |
Co-authored-by: Stefan <[email protected]>
I don't know how to handle PDFObject, but I would like some help. Most of the old code is not taken into account and can be fixed if you get one pattern. Are there any rules that should take over for casting PdfObjects? I would be grateful for tips on how to get started. |
I noticed that when creating a combination of character map and width length, there is a pattern where the value after character map conversion is not represented by a single character. Current tests pass, but this is probably a bug. To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory. We do not plan to include it in this modification along with the vertical character support (for |
This is documented inside the PDF reference and mapped into actual classes by pypdf. As a general rule of thumb (at least for new code): Besides the expected types, we might have additional |
@stefan6419846 Thanks a lot! |
@stefan6419846 I have completed the corrections you indicated.
Consider the case of a move for a new font: I have confirmed that the following five issues have been fixed in this bug fix. |
Sorry, there was a coverage error, so I will delete the unnecessary lines. |
I fixed it. |
@stefan6419846 @pubpub-zz
|
How much of the existing code from this PR would stay the same? If it is just an extension of the code from this PR, I would propose to merge this PR first (which already is an improvement) and do the further improvements in a separate PR to make reviewing easier. |
@stefan6419846 Thank you! I even tried to test the changes, but there were too many changes. I will issue a PR once the merge is complete. |
It looked like #234 could be closed too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. I am going to merge this as-is for now and defer possible updates to later PR. For now, all tests pass and at first look the performance did not degrade substantially.
This is a fix for the problem that occurred when #2882 was changed. The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately. This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351. The change in handle_tj is because it cannot pass Ruff's check. Error: PLR0915 Too many statements (nnn > 176) The following code is only used to get the character code for a space. However, I think it would be better to split the code into parts for obtaining the character code. Style changes are considered in another PR.
## What's new ### New Features (ENH) - Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001 ### Bug Fixes (BUG) - Fix font specificier for FreeText annotation (#2893) by @ssjkamei - Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei - Improve handling of spaces in text extraction (#2882) by @ssjkamei ### Robustness (ROB) - Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846 ### Documentation (DOC) - Use latest package versions (#2907) by @stefan6419846 - Correct example of reading FileAttachment annotation (#2906) by @j-t-1 ### Developer Experience (DEV) - Update pinned requirements (#2918) by @stefan6419846 - Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz ### Maintenance (MAINT) - Remove references to outdated Python versions (#2919) by @stefan6419846 - Generalize the method of obtaining space_code (#2891) by @ssjkamei - Unnecessary character mapping process (#2888) by @ssjkamei - New LZW decoding implementation (#2887) by @MartinThoma ### Testing (TST) - Add LzwCodec for encoding (#2883) by @MartinThoma ### Code Style (STY) - Capitalize error messages (#2903) by @j-t-1 - Modify error messages in PdfWriter (#2902) by @j-t-1 [Full Changelog](5.0.1...5.1.0)
Closes #1153
The change
if abs(moved_height) > 0.8 * f:
at line 129 of thecrlf_space_check
function changes the function to look at both the bottom and top misalignment, but if you prefer not to change it here, I commit the reverted code.The test for hello-world.pdf fails.
I think a hidden bug has surfaced, but I don't seem to have read enough of the documentation to determine how to address it from within the PDF 1.7 specification.
I am judging by the position specified by
Td
andTm
, but perhaps I am judging both as if they were absolute coordinates, causing a misalignment, but I don't know how to correct it correctly.Target String:
สวัสดีชาวโลก