File causes loop method call between functions `extract_xform_text` and `_extract_text` #966

VBobCat · 2022-06-09T19:02:52Z

While reading a certain file, my program exits without any exception being raised.

I investigated the issue and it seems the cause is functions extract_xform_text and _extract_text in _page.py call each other in a neverending loop.

Environment

Which environment were you using when you encountered the problem?

Python 3.10.5
Windows-10-10.0.19044-SP0
PyPDF2 2.1.0

Code

This is a minimal, complete example that shows the issue:

My code (that uses PyPDF2) is this:

def pdf_to_text(filename):
    with open(filename, 'rb') as pdf_file_object:
        # try:
            reader = PyPDF2.PdfFileReader(pdf_file_object, strict=False)
            num_pages = reader.numPages
            if num_pages:
                page_texts = []
                for i in range(num_pages):
                    page = reader.getPage(i)
                    page_text = page.extract_text()
                    page_texts.append(page_text)
                return ' '.join(page_texts)

I put a breakpoint in extract_xform_text and it receives these three parameters (self, xform, space_width):

{
    'self': {
        '/Contents': IndirectObject(2, 0),
        '/CropBox': [0, 0, 595.56, 842.04],
        '/Group': {
            '/CS': '/DeviceRGB',
            '/S': '/Transparency',
            '/Type': '/Group'
        },
        '/MediaBox': [0, 0, 595.56, 842.04], '/Parent': IndirectObject(34, 0),
        '/Resources': {
            '/ExtGState': {
                '/GS0': IndirectObject(14, 0)
            },
            '/Font': {
                '/TT0': IndirectObject(54, 0),
                '/TT1': IndirectObject(57, 0)
            },
            '/ProcSet': ['/PDF', '/Text'],
            '/XObject': {
                '/Fm0': IndirectObject(7, 0)
            }
        },
        '/Rotate': 0,
        '/StructParents': 1,
        '/Tabs': '/S',
        '/Type': '/Page'
    },
    'xform': {
        '/ADBE_FillSign': {
            '/Subtype': '/page',
            '/Type': '/FillSignData'
        },
        '/BBox': [195.809, 527.097, 407.639, 553.999], 
        '/FormType': 1, 
        '/Matrix': [1, 0, 0, 1, 0, 0],
        '/Resources': {
            '/XObject': {
                '/Fm0': IndirectObject(6, 0)
            }
        }, 
        '/Subtype': '/Form', 
        '/Type': '/XObject'
    },
    'space_width': 200.0
}

PDF

I am sorry but I'm unable to share the very PDF file, because it contains sensitive information.

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2022-06-11T09:24:33Z

@VBobCat
To try to de-sensibilize the information, can you try to just cut off the page. if still not possible can you use _debug_for_extract() to extract the document structure (remove the TJ/Tj will remove the text)

MartinThoma · 2022-06-11T13:27:25Z

I think https://corpora.tika.apache.org/base/docs/govdocs1/998/998167.pdf might fall into the same issue:

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1256, in _extract_text
    text = self.extract_xform_text(xobj[operands[0]], space_width)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1256, in _extract_text
    text = self.extract_xform_text(xobj[operands[0]], space_width)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1256, in _extract_text
    text = self.extract_xform_text(xobj[operands[0]], space_width)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1131, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1122, in __init__
    self.__parseContentStream(stream_bytes)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1153, in __parseContentStream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1259, in read_object
    elif idx == 2:
KeyboardInterrupt

pubpub-zz · 2022-06-11T18:14:13Z

https://corpora.tika.apache.org/base/docs/govdocs1/998/998167.pdf
is a good sample I think the issue is because of we are looking for a xobject within the xobject with the same name. I was not looking in the good place :)

in this PR as the test file needs the other fixes (but not linked with loop issue)

pubpub-zz · 2022-06-13T20:01:01Z

this should be closed by #969 (https://github.com/py-pdf/PyPDF2/releases/tag/2.2.0)
@VBobCat can you please confirm your file gets through

MartinThoma · 2022-06-18T17:42:15Z

I'm rather certain that the issue was solved. Please let us know if that is not the case!

VBobCat added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 9, 2022

VBobCat assigned MartinThoma Jun 9, 2022

MartinThoma mentioned this issue Jun 11, 2022

improved ExtractText(3) #969

Merged

MartinThoma added the nf-performance Non-functional change: Performance label Jun 11, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jun 11, 2022

Fix xform in xfoms inducing loop (py-pdf#966)

9768d5f

in this PR as the test file needs the other fixes (but not linked with loop issue)

MartinThoma closed this as completed Jun 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File causes loop method call between functions `extract_xform_text` and `_extract_text` #966

File causes loop method call between functions `extract_xform_text` and `_extract_text` #966

VBobCat commented Jun 9, 2022

pubpub-zz commented Jun 11, 2022

MartinThoma commented Jun 11, 2022 •

edited

Loading

pubpub-zz commented Jun 11, 2022

pubpub-zz commented Jun 13, 2022 •

edited

Loading

MartinThoma commented Jun 18, 2022

File causes loop method call between functions extract_xform_text and _extract_text #966

File causes loop method call between functions extract_xform_text and _extract_text #966

Comments

VBobCat commented Jun 9, 2022

Environment

Code

PDF

pubpub-zz commented Jun 11, 2022

MartinThoma commented Jun 11, 2022 • edited Loading

pubpub-zz commented Jun 11, 2022

pubpub-zz commented Jun 13, 2022 • edited Loading

MartinThoma commented Jun 18, 2022

File causes loop method call between functions `extract_xform_text` and `_extract_text` #966

File causes loop method call between functions `extract_xform_text` and `_extract_text` #966

MartinThoma commented Jun 11, 2022 •

edited

Loading

pubpub-zz commented Jun 13, 2022 •

edited

Loading