possible bug with error `TypeError: 'IndirectObject' object cannot be interpreted as an integer` #2137

rchen19 · 2023-08-31T20:14:33Z

See description below. Seems like a bug to me. This is solved by make the following edits in function compute_space_width in _cmap.py, line 19 in the code below st = w[0] -> st = w[0] if isinstance(w[0], int) else w[0].get_object(), this is in line 412 from the original file, since I am not familiar at all with the lower level implementation of pdf format, I am not sure if this is a bug at all, or if my fix makes sense:

def compute_space_width(
    ft: DictionaryObject, space_code: int, space_width: float
) -> float:
    sp_width: float = space_width * 2.0  # default value
    w = []
    w1 = {}
    st: int = 0
    if "/DescendantFonts" in ft:  # ft["/Subtype"].startswith("/CIDFontType"):
        ft1 = ft["/DescendantFonts"][0].get_object()  # type: ignore
        try:
            w1[-1] = cast(float, ft1["/DW"])
        except Exception:
            w1[-1] = 1000.0
        if "/W" in ft1:
            w = list(ft1["/W"])
        else:
            w = []
        while len(w) > 0:
            # st = w[0]
            # above commented out line is the original, below is my edit:
            st = w[0] if isinstance(w[0], int) else w[0].get_object()
            second = w[1].get_object()
            if isinstance(second, int):
                for x in range(st, second):
                    w1[x] = w[2]
                w = w[3:]
            elif isinstance(second, list):
                for y in second:
                    w1[st] = y
                    st += 1
                w = w[2:]
            else:
                logger_warning(
                    "unknown widths : \n" + (ft1["/W"]).__repr__(),
                    __name__,
                )
                break
        try:
            sp_width = w1[space_code]
        except Exception:
            sp_width = (
                w1[-1] / 2.0
            )  # if using default we consider space will be only half size
    elif "/Widths" in ft:
        w = list(ft["/Widths"])  # type: ignore
        try:
            st = cast(int, ft["/FirstChar"])
            en: int = cast(int, ft["/LastChar"])
            if st > space_code or en < space_code:
                raise Exception("Not in range")
            if w[space_code - st] == 0:
                raise Exception("null width")
            sp_width = w[space_code - st]
        except Exception:
            if "/FontDescriptor" in ft and "/MissingWidth" in cast(
                DictionaryObject, ft["/FontDescriptor"]
            ):
                sp_width = ft["/FontDescriptor"]["/MissingWidth"]  # type: ignore
            else:
                # will consider width of char as avg(width)/2
                m = 0
                cpt = 0
                for x in w:
                    if x > 0:
                        m += x
                        cpt += 1
                sp_width = m / max(1, cpt) / 2
    return sp_width

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-148-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.4, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf

f_path = "data/Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf"
with open(f_path, "rb") as pdf_file_obj:
    p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
    print(p)

The pdf file:
Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/export/home/***/try_parse_pdf.py", line 12, in <module>
    p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 2263, in extract_text
    return self._extract_text(
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 1908, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 89, in build_char_map_from_dict
    sp_width = compute_space_width(ft, sp, space_width)
  File "/export/home/cuda00042/***/***/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 415, in compute_space_width
    for x in range(st, second):
TypeError: 'IndirectObject' object cannot be interpreted as an integer

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-08-31T20:34:52Z

👍 Your change is perfectly valid : you should convert it into a PR. Your test extracting the text from page[0] will cover the validation.

rchen19 · 2023-08-31T21:52:52Z

👍 Your change is perfectly valid : you should convert it into a PR. Your test extracting the text from page[0] will cover the validation.

I am actually not a collaborator on pypdf and do not have the privileges to create branches or PRs. I'd be happy to do it if I can apply for those privileges, not sure how and where.

stefan6419846 · 2023-09-01T06:49:00Z

I am actually not a collaborator on pypdf and do not have the privileges to create branches or PRs. I'd be happy to do it if I can apply for those privileges, not sure how and where.

Just use the general GitHub workflow: Fork the project into your own account, create a new branch with your changes including a corresponding test, then create a pull request against the upstream repository (if you have created your branch and committed some changes, you should see a dialog to create such a pull request on the upstream repository). (Upstream repository means https://github.com/py-pdf/pypdf/ in this case.)

pubpub-zz · 2023-09-05T18:08:13Z

@rchen19
have you been able to build your fork ?

rchen19 · 2023-09-05T18:13:28Z

@rchen19 have you been able to build your fork ?

Yes, should be able to make a PR later today. Thanks.

…of `IndirectObject`

- a pdf file from arxiv is included

- URL too long - file name too long - variable declared but not used

…#2154) Closes #2137

rchen19 added a commit to rchen19/pypdf that referenced this issue Sep 5, 2023

py-pdf#2137 catch the case where w[0] is not an int but an instances …

2cb4352

…of `IndirectObject`

rchen19 mentioned this issue Sep 5, 2023

BUG: catch the case where w[0] is an IndirectObject instead of an int #2154

Merged

rchen19 added a commit to rchen19/pypdf that referenced this issue Sep 6, 2023

BUG: add test for bug fix for issue py-pdf#2137

f92244e

- a pdf file from arxiv is included

rchen19 added a commit to rchen19/pypdf that referenced this issue Sep 6, 2023

BUG: fix code stype errors in test for issue py-pdf#2137

6d9f1fe

- URL too long - file name too long - variable declared but not used

MartinThoma closed this as completed in #2154 Sep 10, 2023

MartinThoma pushed a commit that referenced this issue Sep 10, 2023

BUG: catch the case where w[0] is an IndirectObject instead of an int (…

4657df5

…#2154) Closes #2137

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible bug with error `TypeError: 'IndirectObject' object cannot be interpreted as an integer` #2137

possible bug with error `TypeError: 'IndirectObject' object cannot be interpreted as an integer` #2137

rchen19 commented Aug 31, 2023 •

edited

Loading

pubpub-zz commented Aug 31, 2023

rchen19 commented Aug 31, 2023 •

edited

Loading

stefan6419846 commented Sep 1, 2023

pubpub-zz commented Sep 5, 2023

rchen19 commented Sep 5, 2023

possible bug with error TypeError: 'IndirectObject' object cannot be interpreted as an integer #2137

possible bug with error TypeError: 'IndirectObject' object cannot be interpreted as an integer #2137

Comments

rchen19 commented Aug 31, 2023 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Aug 31, 2023

rchen19 commented Aug 31, 2023 • edited Loading

stefan6419846 commented Sep 1, 2023

pubpub-zz commented Sep 5, 2023

rchen19 commented Sep 5, 2023

possible bug with error `TypeError: 'IndirectObject' object cannot be interpreted as an integer` #2137

possible bug with error `TypeError: 'IndirectObject' object cannot be interpreted as an integer` #2137

rchen19 commented Aug 31, 2023 •

edited

Loading

rchen19 commented Aug 31, 2023 •

edited

Loading