Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Extract Text Enhancement (whitespaces) #1084

Merged
merged 11 commits into from
Jul 13, 2022
Merged

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Jul 9, 2022

  • ENH : extract width from CIDFontType0/2
  • ENH : improve cr/lf and space extraction
  • BUG : fix error in decoding error in decoding #1075
  • FIX: in ToUnicode ignore comments (starting with %)
  • FIX: extend utf16 for min of 4 characters

Improves #234
Improves #957
Closes #1003
Closes #1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

2 types of fix :
a) ignore comments  - starting with %
b) extend utf16 for min of 4 characters

testfile to be added checking https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf.
checking pages [0, 5, 8, 9, 12, 13, 14, 15, 17, 18, 20, 23, 24, 26, 27, 29, 30, 35, 38, 50, 52, 60, 61, 72, 87, 88, 90, 93, 94, 96, 99, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 118, 123, 124, 125, 126, 128, 129, 132, 133, 156, 160, 162, 178, 189, 195, 198, 235, 254, 255, 256, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 275, 276, 277, 278, 280, 281, 282, 283, 285, 287, 290, 292, 293, 295, 296, 297, 298, 299, 301, 302, 306, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 326, 327, 329, 330, 331, 332, 333, 336, 338, 339, 340, 341, 342, 344, 345, 346, 367, 368, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 381, 382, 383, 384, 385, 386, 387, 388, 404, 405, 406, 414, 415, 416, 417, 419, 420]
PyPDF2/_page.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 9, 2022

Codecov Report

Merging #1084 (5715653) into main (af5a0c3) will increase coverage by 0.30%.
The diff coverage is 92.74%.

@@            Coverage Diff             @@
##             main    #1084      +/-   ##
==========================================
+ Coverage   91.64%   91.94%   +0.30%     
==========================================
  Files          24       24              
  Lines        4559     4642      +83     
  Branches      932      957      +25     
==========================================
+ Hits         4178     4268      +90     
+ Misses        241      229      -12     
- Partials      140      145       +5     
Impacted Files Coverage Δ
PyPDF2/_page.py 92.13% <91.48%> (-0.61%) ⬇️
PyPDF2/_cmap.py 93.54% <96.66%> (+9.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af5a0c3...5715653. Read the comment docs.

@MartinThoma MartinThoma changed the title Extract Text Enhancement ENH: Extract Text Enhancement Jul 10, 2022
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jul 10, 2022
@MartinThoma
Copy link
Member

Running the benchmark, we have one outlier which is way worse and the rest being similar:

    "1601.03642": 98.8% -> 98.9%  +0.1
    "1602.06541": 98.0% -> 97.8%  -0.2
    "1707.09725": 94.0% -> 94.1%  +0.1
    "2201.00021": 96.7% -> 96.5%  -0.2
    "2201.00022": 97.2% -> 97.3%  +0.1
    "2201.00029": 97.6% -> 96.7%  -0.9 ---
    "2201.00037": 93.8% -> 94.0%  +0.2
    "2201.00069": 96.3% -> 86.7%  -9.6 !!!!
    "2201.00151": 93.3% -> 93.8%  +0.5
    "2201.00178": 92.9% -> 93.1%  +0.2
    "2201.00200": 97.4% -> 97.4%  ----
    "2201.00201": 98.2% -> 98.2%  -----
    "2201.00214": 97.2% -> 97.4%  +0.2
  "GeoTopo-book": 86.5% -> 87.1%  +0.7

@MartinThoma
Copy link
Member

Changes where it improved:

image

@MartinThoma
Copy link
Member

Where it became so much worse:

image

@MartinThoma
Copy link
Member

Hence PyPDF2 now extracts a lot of characters.

@MartinThoma
Copy link
Member

@pubpub-zz Should I test again? I'm always excited when I see a PR from you 😄

@pubpub-zz
Copy link
Collaborator Author

You can : just normally to fix the "underscores"
However some commits may have been undone

@MartinThoma
Copy link
Member

The latest results:

            "1601.03642": 98.8% -> 98.9% +0.1
            "1602.06541": 98.0% -> 97.8% -0.2
            "1707.09725": 94.0% -> 94.1% +0.1
            "2201.00021": 96.7% -> 96.5% -0.2
            "2201.00022": 97.2% -> 97.3% +0.1
            "2201.00029": 97.6% -> 96.7% -0.9
            "2201.00037": 93.8% -> 94.0% +0.2
            "2201.00069": 96.3% -> 96.4% +0.1
            "2201.00151": 93.3% -> 93.8% +0.5
            "2201.00178": 92.9% -> 93.1% +0.2
            "2201.00200": 97.4% -> 97.4% ---
            "2201.00201": 98.2% -> 98.2% ----
            "2201.00214": 97.2% -> 97.4% +0.2
          "GeoTopo-book": 86.5% -> 87.1% +0.6

@MartinThoma
Copy link
Member

The worst:

image

The best: Having a space between text and math content (formulas and variables)

image

@MartinThoma
Copy link
Member

@pubpub-zz To me, this looks good 👍 If you think it's ready, I would merge it :-)

@pubpub-zz
Copy link
Collaborator Author

Let me check tonight it covers most/all opened issues remaining
Now the list is limited it should be quick 😊

@MartinThoma
Copy link
Member

Please have a look at your first comment - I've edited it to contain what I would use as a squash commit message. Feel free to adjust it :-)

@MartinThoma MartinThoma changed the title ENH: Extract Text Enhancement ENH: Extract Text Enhancement (whitespaces) Jul 11, 2022
@pubpub-zz
Copy link
Collaborator Author

Let me check tonight it covers most/all opened issues remaining
Now the list is limited it should be quick 😊

@MartinThoma
Copy link
Member

MartinThoma commented Jul 12, 2022

Interestingly, the multiplication error didn't make a noticable difference

            "1601.03642": 98.8 -> 98.9% +0.1
            "1602.06541": 98.0 -> 97.8% -0.2
            "1707.09725": 94.0 -> 94.1% +0.1
            "2201.00021": 96.7 -> 96.5% -0.2
            "2201.00022": 97.2 -> 97.3% +0.1
            "2201.00029": 97.6 -> 96.7% -0.9
            "2201.00037": 93.8 -> 94.0% +0.2
            "2201.00069": 96.3 -> 96.4% +0.1
            "2201.00151": 93.3 -> 93.8% +0.5
            "2201.00178": 92.9 -> 93.1% +0.2
            "2201.00200": 97.4 -> 97.4% ---
            "2201.00201": 98.2 -> 98.2% ---
            "2201.00214": 97.2 -> 97.4% +0.2
          "GeoTopo-book": 86.5 -> 87.1% +0.6

@pubpub-zz
Copy link
Collaborator Author

Don't think I will be able to go further for the moment.
I've having a look on #1091. out of the "Float Object" I'm getting poor results (eg on page 1) but I do not understand how this file can be interpreted correctly.
For me I'm good to merge

@MartinThoma
Copy link
Member

Nice! Thank you so much! I'll merge tomorrow morning :-)

@MartinThoma MartinThoma merged commit 682eff9 into py-pdf:main Jul 13, 2022
@MartinThoma
Copy link
Member

@pubpub-zz Your PR is now part of main and will be in PyPDF2==2.5.1. I will likely release on Sunday the 17.07.2022.

Thank you for improving the text extraction once again 🤗

mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this pull request Jul 15, 2022
* ENH : extract width from CIDFontType0/2
* ENH  : improve cr/lf and space extraction
* BUG : fix error in decoding py-pdf#1075
* FIX: in ToUnicode  ignore comments (starting with %)
* FIX: extend utf16 for min of 4 characters

Improves py-pdf#234
Improves py-pdf#957
Closes py-pdf#1003
Closes py-pdf#1019

Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing
MartinThoma added a commit that referenced this pull request Jul 17, 2022
New Features (ENH):
-  Add color and font_format to PdfReader.outlines[i] (#1104)
-  Extract Text Enhancement (whitespaces) (#1084)

Bug Fixes (BUG):
-  Use `build_destination` for named destination outlines (#1128)
-  Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118)
-  Prevent deduplication of PageObject (#1105)
-  None-check in DictionaryObject.read_from_stream (#1113)
-  Avoid IndexError in _cmap.parse_to_unicode (#1110)

Documentation (DOC):
-  Explanation for git submodule
-  Watermark and stamp (#1095)

Maintenance (MAINT):
-  Text extraction improvements (#1126)
-  Destination.color returns ArrayObject instead of tuple as fallback (#1119)
-  Use add_bookmark_destination in add_bookmark (#1100)
-  Use add_bookmark_destination in add_bookmark_dict (#1099)

Testing (TST):
-  Remove xfail from test_outline_title_issue_1121
-  Add test for arab text (#1127)
-  Add xfail for decryption fail (#1125)
-  Add xfail test for IndexError when extracting text (#1124)
-  Add MCVE showing outline title issue (#1123)

Code Style (STY):
-  Apply black and isort
-  Use IntFlag for permissions_flag / update_page_form_field_values (#1094)
-  Simplify code (#1101)

Full Changelog: 2.5.0...2.6.0
@pubpub-zz pubpub-zz deleted the main branch July 19, 2022 19:51
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
Comment on lines +1279 to 1287
if isinstance(operands[0], str):
text += operands[0]
else:
t: str = ""
tt: bytes = (
encode_pdfdocencoding(operands[0])
if isinstance(operands[0], str)
else operands[0]
)
Copy link

@biredel biredel May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second isinstance(operands[0], str) branch looks unreachable here (since moved over here)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Do you want to submit a corresponding PR?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefan6419846 At least not immediately. Clearly there was or is something to that code that I do not understand enough to just delete it from a version that no longer contains the explanation - #2440 was merged much later!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biredel
I'm confused :
you are looking at an obsolete branch (PyPDF2) instead of pypdf.
The code seems to be this one
https://github.com/py-pdf/pypdf/blob/a435eaaa08c71e3f66320edd06be24637ef32986/pypdf/_text_extraction/__init__.py#L225C18-L234C18

Codecov indicates some test coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inter/Intra Word width issue with block-formatted text compute_space_width: Is 'if "/W" in ft' unreachable?
4 participants