MAINT: Generalize the method of obtaining space_code #2891

ssjkamei · 2024-10-05T13:05:56Z

I made changes to separate the part that gets the space code from the part that analyzes it for clarity. I hope you like it.

I was wondering about the space_code determination for type1_alternative.
Perhaps the comparison of the if statement itself is not being used, as I think the content is inadequate.

            if words[2].decode() == b" ":
                space_code = i

codecov · 2024-10-05T13:38:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.43%. Comparing base (fcb103a) to head (d6a6346).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
+ Coverage   96.36%   96.43%   +0.07%     
==========================================
  Files          52       52              
  Lines        8739     8724      -15     
  Branches     1727     1721       -6     
==========================================
- Hits         8421     8413       -8     
+ Misses        186      182       -4     
+ Partials      132      129       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ssjkamei · 2024-10-05T15:48:24Z

At the point where space_code is returned, there is logic that is incorrectly returning Width, and when this part of the code is erased, it no longer enters the code below.
I was not sure if I should delete it, but I did because it would cause an error in code coverage.
I don't think I will put back no encoding on the code.

The part that was wrong: return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]

def parse_encoding(
    ft: DictionaryObject, space_code: int
) -> Tuple[Union[str, Dict[int, str]], int]:
    encoding: Union[str, List[str], Dict[int, str]] = []
    if "/Encoding" not in ft:
        try:
            if "/BaseFont" in ft and cast(str, ft["/BaseFont"]) in charset_encoding:
                encoding = dict(
                    zip(range(256), charset_encoding[cast(str, ft["/BaseFont"])])
                )
            else:
                encoding = "charmap"
            return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]
        except Exception:
            if cast(str, ft["/Subtype"]) == "/Type1":
                return "charmap", space_code
            else:
                return "", space_code
.....

Codes that were erased:

    # encoding can be either a string for decode
    # (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
    # if empty string, it means it is than encoding field is not present and
    # we have to select the good encoding from cmap input data
    if encoding == "":
        if -1 not in map_dict or map_dict[-1] == 1:
            # I have not been able to find any rule for no /Encoding nor /ToUnicode
            # One example shows /Symbol,bold I consider 8 bits encoding default
            encoding = "charmap"
        else:
            encoding = "utf-16-be"

ssjkamei · 2024-10-05T15:55:54Z

Please review the code check as it has passed.

pypdf/_cmap.py

Co-authored-by: Stefan <[email protected]>

stefan6419846

Thanks for your contribution.

@hpierre001

## What's new ### New Features (ENH) - Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001 ### Bug Fixes (BUG) - Fix font specificier for FreeText annotation (#2893) by @ssjkamei - Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei - Improve handling of spaces in text extraction (#2882) by @ssjkamei ### Robustness (ROB) - Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846 ### Documentation (DOC) - Use latest package versions (#2907) by @stefan6419846 - Correct example of reading FileAttachment annotation (#2906) by @j-t-1 ### Developer Experience (DEV) - Update pinned requirements (#2918) by @stefan6419846 - Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz ### Maintenance (MAINT) - Remove references to outdated Python versions (#2919) by @stefan6419846 - Generalize the method of obtaining space_code (#2891) by @ssjkamei - Unnecessary character mapping process (#2888) by @ssjkamei - New LZW decoding implementation (#2887) by @MartinThoma ### Testing (TST) - Add LzwCodec for encoding (#2883) by @MartinThoma ### Code Style (STY) - Capitalize error messages (#2903) by @j-t-1 - Modify error messages in PdfWriter (#2902) by @j-t-1 [Full Changelog](5.0.1...5.1.0)

ssjkamei added 4 commits October 5, 2024 22:00

MAINT: Generalize the method of obtaining space_code

2869249

Return value correction omission

3ed2358

Style: Correcting code style issues

bc96d93

Style: Correcting code style issues

3273651

ssjkamei added 7 commits October 5, 2024 23:16

fix self-made bugs

e29a0ef

Style: Correcting code style issues

d794b08

fix self-made bugs

ad5a201

fix self-made bugs

92e6412

Style: Correcting code style issues

466b0d9

Deletion of unneeded codes

dd50d06

Deletion of unneeded codes

83d4899

stefan6419846 reviewed Oct 5, 2024

View reviewed changes

pypdf/_cmap.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Oct 5, 2024

View reviewed changes

pypdf/_cmap.py Outdated Show resolved Hide resolved

ssjkamei and others added 2 commits October 6, 2024 13:14

Update pypdf/_cmap.py

e468a4d

Co-authored-by: Stefan <[email protected]>

Update pypdf/_cmap.py

d6a6346

Co-authored-by: Stefan <[email protected]>

stefan6419846 approved these changes Oct 6, 2024

View reviewed changes

stefan6419846 merged commit 96b46ad into py-pdf:main Oct 6, 2024
16 checks passed

ssjkamei deleted the MAINT--Generalize-the-method-of-obtaining-space_code branch October 6, 2024 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Generalize the method of obtaining space_code #2891

MAINT: Generalize the method of obtaining space_code #2891

ssjkamei commented Oct 5, 2024

codecov bot commented Oct 5, 2024 •

edited

Loading

ssjkamei commented Oct 5, 2024

ssjkamei commented Oct 5, 2024

stefan6419846 left a comment

MAINT: Generalize the method of obtaining space_code #2891

MAINT: Generalize the method of obtaining space_code #2891

Conversation

ssjkamei commented Oct 5, 2024

codecov bot commented Oct 5, 2024 • edited Loading

Codecov Report

ssjkamei commented Oct 5, 2024

ssjkamei commented Oct 5, 2024

stefan6419846 left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 5, 2024 •

edited

Loading