Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Generalize the method of obtaining space_code #2891

Conversation

ssjkamei
Copy link
Contributor

@ssjkamei ssjkamei commented Oct 5, 2024

I made changes to separate the part that gets the space code from the part that analyzes it for clarity. I hope you like it.

I was wondering about the space_code determination for type1_alternative.
Perhaps the comparison of the if statement itself is not being used, as I think the content is inadequate.

            if words[2].decode() == b" ":
                space_code = i

Copy link

codecov bot commented Oct 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.43%. Comparing base (fcb103a) to head (d6a6346).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
+ Coverage   96.36%   96.43%   +0.07%     
==========================================
  Files          52       52              
  Lines        8739     8724      -15     
  Branches     1727     1721       -6     
==========================================
- Hits         8421     8413       -8     
+ Misses        186      182       -4     
+ Partials      132      129       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 5, 2024

At the point where space_code is returned, there is logic that is incorrectly returning Width, and when this part of the code is erased, it no longer enters the code below.
I was not sure if I should delete it, but I did because it would cause an error in code coverage.
I don't think I will put back no encoding on the code.

The part that was wrong: return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]

def parse_encoding(
    ft: DictionaryObject, space_code: int
) -> Tuple[Union[str, Dict[int, str]], int]:
    encoding: Union[str, List[str], Dict[int, str]] = []
    if "/Encoding" not in ft:
        try:
            if "/BaseFont" in ft and cast(str, ft["/BaseFont"]) in charset_encoding:
                encoding = dict(
                    zip(range(256), charset_encoding[cast(str, ft["/BaseFont"])])
                )
            else:
                encoding = "charmap"
            return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]
        except Exception:
            if cast(str, ft["/Subtype"]) == "/Type1":
                return "charmap", space_code
            else:
                return "", space_code
.....

Codes that were erased:

    # encoding can be either a string for decode
    # (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
    # if empty string, it means it is than encoding field is not present and
    # we have to select the good encoding from cmap input data
    if encoding == "":
        if -1 not in map_dict or map_dict[-1] == 1:
            # I have not been able to find any rule for no /Encoding nor /ToUnicode
            # One example shows /Symbol,bold I consider 8 bits encoding default
            encoding = "charmap"
        else:
            encoding = "utf-16-be"

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 5, 2024

Please review the code check as it has passed.

pypdf/_cmap.py Outdated Show resolved Hide resolved
pypdf/_cmap.py Outdated Show resolved Hide resolved
ssjkamei and others added 2 commits October 6, 2024 13:14
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.

@stefan6419846 stefan6419846 merged commit 96b46ad into py-pdf:main Oct 6, 2024
16 checks passed
@ssjkamei ssjkamei deleted the MAINT--Generalize-the-method-of-obtaining-space_code branch October 6, 2024 11:25
stefan6419846 added a commit that referenced this pull request Oct 27, 2024
## What's new

### New Features (ENH)
- Add `layout_mode_font_height_weight` argument to `PageObject.extract_text()` (#2920) by @hpierre001

### Bug Fixes (BUG)
- Fix font specificier for FreeText annotation (#2893) by @ssjkamei
- Line breaks are not generated due to incorrect calculation of text leading (#2890) by @ssjkamei
- Improve handling of spaces in text extraction (#2882) by @ssjkamei

### Robustness (ROB)
- Soft failure for flate encode image mode 1 with wrong LUT size (#2900) by @stefan6419846

### Documentation (DOC)
- Use latest package versions (#2907) by @stefan6419846
- Correct example of reading FileAttachment annotation (#2906) by @j-t-1

### Developer Experience (DEV)
- Update pinned requirements (#2918) by @stefan6419846
- Make make_release.py compatible with Windows environment (#2894) by @pubpub-zz

### Maintenance (MAINT)
- Remove references to outdated Python versions (#2919) by @stefan6419846
- Generalize the method of obtaining space_code (#2891) by @ssjkamei
- Unnecessary character mapping process (#2888) by @ssjkamei
- New LZW decoding implementation (#2887) by @MartinThoma

### Testing (TST)
- Add LzwCodec for encoding (#2883) by @MartinThoma

### Code Style (STY)
- Capitalize error messages (#2903) by @j-t-1
- Modify error messages in PdfWriter (#2902) by @j-t-1

[Full Changelog](5.0.1...5.1.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants