Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File causes loop method call between functions extract_xform_text and _extract_text #966

Closed
VBobCat opened this issue Jun 9, 2022 · 5 comments
Assignees
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-performance Non-functional change: Performance

Comments

@VBobCat
Copy link

VBobCat commented Jun 9, 2022

While reading a certain file, my program exits without any exception being raised.

I investigated the issue and it seems the cause is functions extract_xform_text and _extract_text in _page.py call each other in a neverending loop.

Environment

Which environment were you using when you encountered the problem?

Python 3.10.5
Windows-10-10.0.19044-SP0
PyPDF2 2.1.0

Code

This is a minimal, complete example that shows the issue:

My code (that uses PyPDF2) is this:

def pdf_to_text(filename):
    with open(filename, 'rb') as pdf_file_object:
        # try:
            reader = PyPDF2.PdfFileReader(pdf_file_object, strict=False)
            num_pages = reader.numPages
            if num_pages:
                page_texts = []
                for i in range(num_pages):
                    page = reader.getPage(i)
                    page_text = page.extract_text()
                    page_texts.append(page_text)
                return ' '.join(page_texts)

I put a breakpoint in extract_xform_text and it receives these three parameters (self, xform, space_width):

{
    'self': {
        '/Contents': IndirectObject(2, 0),
        '/CropBox': [0, 0, 595.56, 842.04],
        '/Group': {
            '/CS': '/DeviceRGB',
            '/S': '/Transparency',
            '/Type': '/Group'
        },
        '/MediaBox': [0, 0, 595.56, 842.04], '/Parent': IndirectObject(34, 0),
        '/Resources': {
            '/ExtGState': {
                '/GS0': IndirectObject(14, 0)
            },
            '/Font': {
                '/TT0': IndirectObject(54, 0),
                '/TT1': IndirectObject(57, 0)
            },
            '/ProcSet': ['/PDF', '/Text'],
            '/XObject': {
                '/Fm0': IndirectObject(7, 0)
            }
        },
        '/Rotate': 0,
        '/StructParents': 1,
        '/Tabs': '/S',
        '/Type': '/Page'
    },
    'xform': {
        '/ADBE_FillSign': {
            '/Subtype': '/page',
            '/Type': '/FillSignData'
        },
        '/BBox': [195.809, 527.097, 407.639, 553.999], 
        '/FormType': 1, 
        '/Matrix': [1, 0, 0, 1, 0, 0],
        '/Resources': {
            '/XObject': {
                '/Fm0': IndirectObject(6, 0)
            }
        }, 
        '/Subtype': '/Form', 
        '/Type': '/XObject'
    },
    'space_width': 200.0
}

PDF

I am sorry but I'm unable to share the very PDF file, because it contains sensitive information.

@VBobCat VBobCat added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 9, 2022
@pubpub-zz
Copy link
Collaborator

@VBobCat
To try to de-sensibilize the information, can you try to just cut off the page. if still not possible can you use _debug_for_extract() to extract the document structure (remove the TJ/Tj will remove the text)

@MartinThoma
Copy link
Member

MartinThoma commented Jun 11, 2022

I think https://corpora.tika.apache.org/base/docs/govdocs1/998/998167.pdf might fall into the same issue:

  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1256, in _extract_text
    text = self.extract_xform_text(xobj[operands[0]], space_width)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1256, in _extract_text
    text = self.extract_xform_text(xobj[operands[0]], space_width)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1256, in _extract_text
    text = self.extract_xform_text(xobj[operands[0]], space_width)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1295, in extract_xform_text
    return self._extract_text(xform, self.pdf, space_width, None)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1131, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1122, in __init__
    self.__parseContentStream(stream_bytes)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1153, in __parseContentStream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py", line 1259, in read_object
    elif idx == 2:
KeyboardInterrupt

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Jun 11, 2022
@pubpub-zz
Copy link
Collaborator

https://corpora.tika.apache.org/base/docs/govdocs1/998/998167.pdf
is a good sample I think the issue is because of we are looking for a xobject within the xobject with the same name. I was not looking in the good place :)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jun 11, 2022
in this PR as the test file needs the other fixes (but not linked with loop issue)
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jun 13, 2022

this should be closed by #969 (https://github.com/py-pdf/PyPDF2/releases/tag/2.2.0)
@VBobCat can you please confirm your file gets through

@MartinThoma
Copy link
Member

I'm rather certain that the issue was solved. Please let us know if that is not the case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-performance Non-functional change: Performance
Projects
None yet
Development

No branches or pull requests

3 participants