BUG: Issue in text extraction (spaces) (#1153) #2882

ssjkamei · 2024-09-28T10:21:44Z

The change if abs(moved_height) > 0.8 * f: at line 129 of the crlf_space_check function changes the function to look at both the bottom and top misalignment, but if you prefer not to change it here, I commit the reverted code.

The test for hello-world.pdf fails.
I think a hidden bug has surfaced, but I don't seem to have read enough of the documentation to determine how to address it from within the PDF 1.7 specification.

I am judging by the position specified by Td and Tm, but perhaps I am judging both as if they were absolute coordinates, causing a misalignment, but I don't know how to correct it correctly.

Target String: สวัสดีชาวโลก

This reverts commit 5400f5a.

This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

…nt size comparison to ratio

Co-authored-by: Stefan <[email protected]>

…n efficiency

…he assertion process

ssjkamei · 2024-09-28T10:25:49Z

I forgot one more thing. I don't understand what the 15 specified here means after all.

pypdf/pypdf/_text_extraction/__init__.py

Line 136 in c8220c6

and abs(delta_x) > spacewidth * f * 15

pubpub-zz · 2024-09-28T11:22:08Z

I forgot one more thing. I don't understand what the 15 specified here means after all.

pypdf/pypdf/_text_extraction/__init__.py

Line 136 in c8220c6

and abs(delta_x) > spacewidth * f * 15

this is a long time ago and I must admit that I cannot remember why this value. will try to gather my memories

ssjkamei · 2024-09-28T15:10:16Z

Sorry, I added the code for CIDFont, but it seems to need mapping to cmap[1].

Also, I am calculating the font each time, which is very inefficient.

It doesn't seem to support vertical type characters, but since it seems to be from the original, I don't intend to include that in this PR.

ssjkamei · 2024-09-28T21:02:11Z

CIDFont is supported based on the PDF in the following Issue.

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336
I think we can close #2336 as well.

The following will remain, but the spaces are moving correctly (because the letters are shaved, the distance traveled is greater than the letters, and the spaces are showing)

Expected: R. Ottokar Doerffel, 1489, Atiradores
Actual: R. O okar Doerﬀ el, 1489, Ar adores

This is probably analogous to the following problem.
Ligature issue when converting PDF to text #1351

Dealing with the calculation of fonts each time will be revised later.

Co-authored-by: Stefan <[email protected]>

ssjkamei · 2024-10-01T14:16:40Z

I don't know how to handle PDFObject, but I would like some help. Most of the old code is not taken into account and can be fixed if you get one pattern.

Are there any rules that should take over for casting PdfObjects?
If I refer to _data_structures.py, does it have all the types listed and do I just check all the types that can be cast?
Or is there a document that can determine what types /W and /Widths can have?

I would be grateful for tips on how to get started.

ssjkamei · 2024-10-02T04:00:10Z

I noticed that when creating a combination of character map and width length, there is a pattern where the value after character map conversion is not represented by a single character. Current tests pass, but this is probably a bug.

To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.

We do not plan to include it in this modification along with the vertical character support (for /W2 and /DW2).

stefan6419846 · 2024-10-02T08:55:53Z

Or is there a document that can determine what types /W and /Widths can have?

This is documented inside the PDF reference and mapped into actual classes by pypdf. As a general rule of thumb (at least for new code): Besides the expected types, we might have additional IndirectObject references as well.

ssjkamei · 2024-10-02T11:48:33Z

@stefan6419846 Thanks a lot!

ssjkamei · 2024-10-02T14:23:26Z

@stefan6419846 I have completed the corrections you indicated.
Other omissions regarding single point heights have been added.

if abs(moved_height) > 0.8 * str_height * scale_prev_y:

Consider the case of a move for a new font: if abs(moved_height) > 0.8 * min(str_height * scale_prev_y, font_size * scale_y):

I have confirmed that the following five issues have been fixed in this bug fix.
#1153, #1362, #1974, #2336, #2777

ssjkamei · 2024-10-02T14:26:55Z

Sorry, there was a coverage error, so I will delete the unnecessary lines.

ssjkamei · 2024-10-02T14:39:17Z

I fixed it.

ssjkamei · 2024-10-03T03:06:03Z

@stefan6419846 @pubpub-zz
I am very sorry that you reviewed the code, but we have come up with a form of support for the following without including it in the cmap.
The code is easier to understand and the process is more efficient. Would it be better to include this as well?

To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.

stefan6419846 · 2024-10-03T08:46:21Z

How much of the existing code from this PR would stay the same? If it is just an extension of the code from this PR, I would propose to merge this PR first (which already is an improvement) and do the further improvements in a separate PR to make reviewing easier.

ssjkamei · 2024-10-03T09:20:24Z

@stefan6419846 Thank you! I even tried to test the changes, but there were too many changes. I will issue a PR once the merge is complete.

ssjkamei · 2024-10-03T12:26:30Z

It looked like #234 could be closed too.

stefan6419846

Thanks for working on this. I am going to merge this as-is for now and defer possible updates to later PR. For now, all tests pass and at first look the performance did not degrade substantially.

This is a fix for the problem that occurred when #2882 was changed. The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately. This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351. The change in handle_tj is because it cannot pass Ruff's check. Error: PLR0915 Too many statements (nnn > 176) The following code is only used to get the character code for a space. However, I think it would be better to split the code into parts for obtaining the character code. Style changes are considered in another PR.

@hpierre001

## What's new ### New Features (ENH) - Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001 ### Bug Fixes (BUG) - Fix font specificier for FreeText annotation (#2893) by @ssjkamei - Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei - Improve handling of spaces in text extraction (#2882) by @ssjkamei ### Robustness (ROB) - Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846 ### Documentation (DOC) - Use latest package versions (#2907) by @stefan6419846 - Correct example of reading FileAttachment annotation (#2906) by @j-t-1 ### Developer Experience (DEV) - Update pinned requirements (#2918) by @stefan6419846 - Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz ### Maintenance (MAINT) - Remove references to outdated Python versions (#2919) by @stefan6419846 - Generalize the method of obtaining space_code (#2891) by @ssjkamei - Unnecessary character mapping process (#2888) by @ssjkamei - New LZW decoding implementation (#2887) by @MartinThoma ### Testing (TST) - Add LzwCodec for encoding (#2883) by @MartinThoma ### Code Style (STY) - Capitalize error messages (#2903) by @j-t-1 - Modify error messages in PdfWriter (#2902) by @j-t-1 [Full Changelog](5.0.1...5.1.0)

ssjkamei and others added 13 commits September 24, 2024 13:07

BUG: Missing spaces in extract_text() method (py-pdf#1328)

5400f5a

Revert "BUG: Missing spaces in extract_text() method (py-pdf#1328)"

aac0436

This reverts commit 5400f5a.

BUG: Missing spaces in extract_text() method (py-pdf#1328)

64b1c92

BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

70e9b38

Revert "BUG: Missing spaces in extract_text() method (py-pdf#1328)"

65224e1

This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

Merge branch 'main' of https://github.com/ssjkamei/pypdf

788d56d

BUG: Missing spaces in extract_text() method (py-pdf#1328) Convert fo…

f6dcb43

…nt size comparison to ratio

Correction to new file URL.

fd1c489

Co-authored-by: Stefan <[email protected]>

BUG: Missing spaces in extract_text() method (py-pdf#1328) calculatio…

2873b9e

…n efficiency

BUG: Missing spaces in extract_text() method (py-pdf#1328) Simplify t…

7597704

…he assertion process

Merge branch 'py-pdf:main' into main

4a2afe9

BUG: Issue in text extraction (spaces) (py-pdf#1153)

fb4de41

BUG: Issue in text extraction (spaces) (py-pdf#1153) add test

373eaec

style: Correcting code style issues

066f594

ssjkamei added 2 commits September 28, 2024 21:18

Text position return support

d406e23

Add code for CIDFont

d338e18

Added horizontal CIDFont calculation code

f7c4236

ssjkamei added 9 commits September 29, 2024 06:36

Style: Correcting code style issues

a32fbc9

Integrate font width calculation and space width calculation

a237f2d

Font width map and space width acquisition process separation

e159e4d

Revert to original adjustment space width

a19a8f4

Supports diagonal travel distance

6dbda50

Font size defaults to twice the space

34efe52

Get the default space width from the argument

52aa7ac

fix self-made bugs

7a028bb

Style: Correcting code style issues

f02fa23

Update pypdf/_text_extraction/__init__.py

20a6883

Co-authored-by: Stefan <[email protected]>

Exception code omitted

e6132fa

ssjkamei added 2 commits October 2, 2024 20:42

Style: Correcting code style issues

9a82eb8

Style: Correcting code style issues

d4f1835

ssjkamei added 3 commits October 2, 2024 21:22

fix self-made bugs

96fcf7c

fix self-made bugs

780a632

Insufficient height consideration for front and rear fonts

ce11d0d

style: Correcting code style issues

03eb1cb

ssjkamei mentioned this pull request Oct 2, 2024

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Closed

ssjkamei mentioned this pull request Oct 3, 2024

Random whitespaces are inserted when using page.extract_text() #1507

Closed

stefan6419846 approved these changes Oct 3, 2024

View reviewed changes

stefan6419846 merged commit d5233a0 into py-pdf:main Oct 3, 2024
16 checks passed

This was referenced Oct 3, 2024

Space regression by PR 1172 #1362

Closed

New line character missing and URLs adding periods and space #1974

Closed

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

Closed

BUG: Added line-breaks at dashes #234

Closed

ssjkamei mentioned this pull request Oct 4, 2024

MAINT: Unnecessary character mapping process #2888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Issue in text extraction (spaces) (#1153) #2882

BUG: Issue in text extraction (spaces) (#1153) #2882

ssjkamei commented Sep 28, 2024

ssjkamei commented Sep 28, 2024

pubpub-zz commented Sep 28, 2024 •

edited

Loading

ssjkamei commented Sep 28, 2024 •

edited

Loading

ssjkamei commented Sep 28, 2024 •

edited

Loading

ssjkamei commented Oct 1, 2024 •

edited

Loading

ssjkamei commented Oct 2, 2024

stefan6419846 commented Oct 2, 2024

ssjkamei commented Oct 2, 2024

ssjkamei commented Oct 2, 2024 •

edited

Loading

ssjkamei commented Oct 2, 2024

ssjkamei commented Oct 2, 2024

ssjkamei commented Oct 3, 2024 •

edited

Loading

stefan6419846 commented Oct 3, 2024

ssjkamei commented Oct 3, 2024

ssjkamei commented Oct 3, 2024

stefan6419846 left a comment

BUG: Issue in text extraction (spaces) (#1153) #2882

BUG: Issue in text extraction (spaces) (#1153) #2882

Conversation

ssjkamei commented Sep 28, 2024

ssjkamei commented Sep 28, 2024

pubpub-zz commented Sep 28, 2024 • edited Loading

ssjkamei commented Sep 28, 2024 • edited Loading

ssjkamei commented Sep 28, 2024 • edited Loading

ssjkamei commented Oct 1, 2024 • edited Loading

ssjkamei commented Oct 2, 2024

stefan6419846 commented Oct 2, 2024

ssjkamei commented Oct 2, 2024

ssjkamei commented Oct 2, 2024 • edited Loading

ssjkamei commented Oct 2, 2024

ssjkamei commented Oct 2, 2024

ssjkamei commented Oct 3, 2024 • edited Loading

stefan6419846 commented Oct 3, 2024

ssjkamei commented Oct 3, 2024

ssjkamei commented Oct 3, 2024

stefan6419846 left a comment

Choose a reason for hiding this comment

pubpub-zz commented Sep 28, 2024 •

edited

Loading

ssjkamei commented Sep 28, 2024 •

edited

Loading

ssjkamei commented Sep 28, 2024 •

edited

Loading

ssjkamei commented Oct 1, 2024 •

edited

Loading

ssjkamei commented Oct 2, 2024 •

edited

Loading

ssjkamei commented Oct 3, 2024 •

edited

Loading