Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bug with error TypeError: 'IndirectObject' object cannot be interpreted as an integer #2137

Closed
rchen19 opened this issue Aug 31, 2023 · 5 comments · Fixed by #2154
Closed

Comments

@rchen19
Copy link
Contributor

rchen19 commented Aug 31, 2023

See description below. Seems like a bug to me. This is solved by make the following edits in function compute_space_width in _cmap.py, line 19 in the code below st = w[0] -> st = w[0] if isinstance(w[0], int) else w[0].get_object(), this is in line 412 from the original file, since I am not familiar at all with the lower level implementation of pdf format, I am not sure if this is a bug at all, or if my fix makes sense:

def compute_space_width(
    ft: DictionaryObject, space_code: int, space_width: float
) -> float:
    sp_width: float = space_width * 2.0  # default value
    w = []
    w1 = {}
    st: int = 0
    if "/DescendantFonts" in ft:  # ft["/Subtype"].startswith("/CIDFontType"):
        ft1 = ft["/DescendantFonts"][0].get_object()  # type: ignore
        try:
            w1[-1] = cast(float, ft1["/DW"])
        except Exception:
            w1[-1] = 1000.0
        if "/W" in ft1:
            w = list(ft1["/W"])
        else:
            w = []
        while len(w) > 0:
            # st = w[0]
            # above commented out line is the original, below is my edit:
            st = w[0] if isinstance(w[0], int) else w[0].get_object()
            second = w[1].get_object()
            if isinstance(second, int):
                for x in range(st, second):
                    w1[x] = w[2]
                w = w[3:]
            elif isinstance(second, list):
                for y in second:
                    w1[st] = y
                    st += 1
                w = w[2:]
            else:
                logger_warning(
                    "unknown widths : \n" + (ft1["/W"]).__repr__(),
                    __name__,
                )
                break
        try:
            sp_width = w1[space_code]
        except Exception:
            sp_width = (
                w1[-1] / 2.0
            )  # if using default we consider space will be only half size
    elif "/Widths" in ft:
        w = list(ft["/Widths"])  # type: ignore
        try:
            st = cast(int, ft["/FirstChar"])
            en: int = cast(int, ft["/LastChar"])
            if st > space_code or en < space_code:
                raise Exception("Not in range")
            if w[space_code - st] == 0:
                raise Exception("null width")
            sp_width = w[space_code - st]
        except Exception:
            if "/FontDescriptor" in ft and "/MissingWidth" in cast(
                DictionaryObject, ft["/FontDescriptor"]
            ):
                sp_width = ft["/FontDescriptor"]["/MissingWidth"]  # type: ignore
            else:
                # will consider width of char as avg(width)/2
                m = 0
                cpt = 0
                for x in w:
                    if x > 0:
                        m += x
                        cpt += 1
                sp_width = m / max(1, cpt) / 2
    return sp_width

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-148-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.4, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf

f_path = "data/Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf"
with open(f_path, "rb") as pdf_file_obj:
    p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
    print(p)

The pdf file:
Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/export/home/***/try_parse_pdf.py", line 12, in <module>
    p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 2263, in extract_text
    return self._extract_text(
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 1908, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 89, in build_char_map_from_dict
    sp_width = compute_space_width(ft, sp, space_width)
  File "/export/home/cuda00042/***/***/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 415, in compute_space_width
    for x in range(st, second):
TypeError: 'IndirectObject' object cannot be interpreted as an integer
@pubpub-zz
Copy link
Collaborator

👍 Your change is perfectly valid : you should convert it into a PR. Your test extracting the text from page[0] will cover the validation.

@rchen19
Copy link
Contributor Author

rchen19 commented Aug 31, 2023

👍 Your change is perfectly valid : you should convert it into a PR. Your test extracting the text from page[0] will cover the validation.

I am actually not a collaborator on pypdf and do not have the privileges to create branches or PRs. I'd be happy to do it if I can apply for those privileges, not sure how and where.

@stefan6419846
Copy link
Collaborator

I am actually not a collaborator on pypdf and do not have the privileges to create branches or PRs. I'd be happy to do it if I can apply for those privileges, not sure how and where.

Just use the general GitHub workflow: Fork the project into your own account, create a new branch with your changes including a corresponding test, then create a pull request against the upstream repository (if you have created your branch and committed some changes, you should see a dialog to create such a pull request on the upstream repository). (Upstream repository means https://github.com/py-pdf/pypdf/ in this case.)

@pubpub-zz
Copy link
Collaborator

@rchen19
have you been able to build your fork ?

@rchen19
Copy link
Contributor Author

rchen19 commented Sep 5, 2023

@rchen19 have you been able to build your fork ?

Yes, should be able to make a PR later today. Thanks.

rchen19 added a commit to rchen19/pypdf that referenced this issue Sep 5, 2023
rchen19 added a commit to rchen19/pypdf that referenced this issue Sep 6, 2023
- a pdf file from arxiv is included
rchen19 added a commit to rchen19/pypdf that referenced this issue Sep 6, 2023
- URL too long

- file name too long

- variable declared but not used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants