Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: clean_forms function cause infinite looping if elt["/Resources"] have circular relation #2477

Closed
wants to merge 6 commits into from
Closed

Conversation

syan-dev
Copy link

I have several PDF files from customers, and using remove_text can cause infinite looping. Upon investigation, I discovered a corner case where elt["/Resources"] has a circular relation, which can result in calling clean_forms(content, stack + [elt]) infinitely.

I proposed keeping a memory variable visited_resources to keep track of which elt["/Resources"] has been processed and to avoid infinite looping.

The corner case files are private and cannot be shared, but I believe many people would encounter the same problem.
Closes #2474

@stefan6419846
Copy link
Collaborator

Thanks for the PR. Are you able to provide a corresponding test file privately to @MartinThoma or strip one of the files down to further use it inside the tests and thus ensure that this keeps being fixed and coverage does not drop?

@syan-dev
Copy link
Author

Hi @stefan6419846, these PDF files are private; therefore, I cannot share them with you. However, if I encounter any files experiencing similar issues, I will send them to you later. I hope you can incorporate this code so that I can simply install PyPDF with pip install. Thank you for providing this excellent source code.

@stefan6419846
Copy link
Collaborator

Apparently this "just" exceeds the recursion depth instead of blocking completely? At least this is what my experimental code to reproduce this issue indicates. (My code still breaks with this PR, but as it just exploits this with custom Python-based datastructures and no real PDF file, I do not consider this essential for merging.)

Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf_upstream/pypdf/run.py", line 54, in <module>
    writer.remove_objects_from_page(page=page, to_delete=ObjectDeletionFlag.NONE)
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1836, in remove_objects_from_page
    images, forms = clean_forms(page, [], [])
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
    clean_forms(content, stack + [elt], visited_objects)  # clean sub forms
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
    clean_forms(content, stack + [elt], visited_objects)  # clean sub forms
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
    clean_forms(content, stack + [elt], visited_objects)  # clean sub forms
  [Previous line repeated 984 more times]
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1803, in clean_forms
    content = ContentStream(o, self)
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/generic/_data_structures.py", line 941, in __init__
    if isinstance(stream, ArrayObject):
  File "/usr/lib/python3.10/typing.py", line 1503, in __instancecheck__
    issubclass(instance.__class__, cls)):
  File "/usr/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
RecursionError: maximum recursion depth exceeded in comparison

Unless @pubpub-zz states that your fix might have negative impacts, I plan on merging it tomorrow.

I still appreciate a corresponding unit/integration test nevertheless if you are able to provide one - even if its a separate PR later on.

@pubpub-zz
Copy link
Collaborator

If you look at the code, there is already a stack mecanism designed to prevent infinite loops. I therefore do not understand we need to 'duplicate' it with a new list.
In order to have a better understanding, can you at least provide a screen capture of the left list of pdfbox in 'debug':
pdfbox-app-3.0.0-alpha3.jar debug
pdfbox can be found here: https://pdfbox.apache.org/2.0/commandline.html

this should help us to understand what is inside.

Currently I do not recommend merging

@stefan6419846 stefan6419846 added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Feb 28, 2024
@syan-dev
Copy link
Author

image
image
image
image
Here is the image of looping, I try my best and can not understand why this happens but it takes forever to load only one page with this kind of looping

@syan-dev
Copy link
Author

I think I have found the root cause for this, I have printed to check if elt in stack with elt value and stack value.
You can see that in the following output, elt is in stack but false to check elt in stack because we have included /Resources in the item of stack but not elt.

elt in stack: 
False
elt value:
{'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792]}
stack value:
[{'/Type': '/Page', '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI'], '/XObject': {'/Im0': IndirectObject(5, 0, 140032982484208), '/Fm0': IndirectObject(6, 0, 140032982484208)}, '/Font': {'/F0': IndirectObject(8, 0, 140032982484208), '/F1': IndirectObject(9, 0, 140032982484208), '/F2': IndirectObject(10, 0, 140032982484208), '/F3': IndirectObject(36, 0, 140032982484208), '/F4': IndirectObject(41, 0, 140032982484208)}}, '/Contents': IndirectObject(46, 0, 140032982484208), '/MediaBox': [0, 0, 612, 792], '/CropBox': [0, 0, 612, 792], '/Rotate': 0, '/Parent': IndirectObject(1, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(7, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(12, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(14, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(16, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(18, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(20, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(22, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(24, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(26, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(34, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(28, 0, 140032982484208)}]

Therefore, my propose solution is to also include /Resources key to correct comparison:

if k1 not in ["/Length", "/Filter", "/DecodeParms", "/Resources"]

@syan-dev
Copy link
Author

syan-dev commented Mar 1, 2024

Hi @pubpub-zz , can you help me review the screenshot from pdfbox.

@pubpub-zz
Copy link
Collaborator

Review in progress

@pubpub-zz
Copy link
Collaborator

a) first I'm confused with your output : the infinite is within your PDF and is not normal.
b) I've made a small change and I've added some logs : can you run the program on your document replacing clean_forms with this:

        def clean_forms(
            elt: DictionaryObject, stack: List[DictionaryObject]
        ) -> Tuple[List[str], List[str]]:
            nonlocal to_delete
            try:
                print("elt : ",elt.indirect_reference)
            except Exception:
                print("elt  ---")
            print("stack ind: ", [s.indirect_reference for s in stack])
            print("stack native: ", stack)
            print("in? ",elt in stack)
            if elt in stack:
                # to prevent infinite looping
                return [], []  # pragma: no cover
            try:
                d = cast(
                    Dict[Any, Any],
                    cast(DictionaryObject, elt["/Resources"])["/XObject"],
                )
            except KeyError:
                d = {}
            images = []
            forms = []
            for k, v in d.items():
                print("k",k)
                o = v.get_object()
                try:
                    content: Any = None
                    if (
                        to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
                        and o["/Subtype"] == "/Image"
                    ):
                        content = NullObject()  # to delete the image keeping the entry
                        images.append(k)
                        i
                    if o["/Subtype"] == "/Form":
                        forms.append(k)
                        if isinstance(o, ContentStream):
                            content = o
                        else:
                            content = ContentStream(o, self)
                            content.update(
                                {
                                    k1: v1
                                    for k1, v1 in o.items()
                                    if k1 not in ["/Length", "/Filter", "/DecodeParms"]
                                }
                            )
                        stack.append(elt)
                        print("call")
                        clean_forms(content, stack)  # clean sub forms
                    if content is not None:
                        if isinstance(v, IndirectObject):
                            self._objects[v.idnum - 1] = content
                        else:
                            # should only occur with pdf not respecting pdf spec
                            # where streams must be indirected.
                            d[k] = self._add_object(content)  # pragma: no cover
                except (TypeError, KeyError):
                    pass
            for im in images:
                del d[im]  # for clean-up
            if isinstance(elt, StreamObject):  # for /Form
                if not isinstance(elt, ContentStream):  # pragma: no cover
                    e = ContentStream(elt, self)
                    e.update(elt.items())
                    elt = e
                clean(elt, images, forms)  # clean the content
            return images, forms

        if not isinstance(page, PageObject):
            page = PageObject(self, page.indirect_reference)  # pragma: no cover
        if "/Contents" in page:
            content = cast(ContentStream, page.get_contents())

            images, forms = clean_forms(page, [])

            clean(content, images, forms)
            page.replace_contents(content)

@syan-dev
Copy link
Author

syan-dev commented Mar 5, 2024

Hi @pubpub-zz , here is the log from your code. I have to edit it a little to avoid print ContentStream error

print("stack ind: ", [s.indirect_reference for s in stack])
print("stack ind: ", [s.indirect_reference for s in stack if not isinstance(s, ContentStream)])

log.txt

@pubpub-zz
Copy link
Collaborator

a) I have some problem to understand your traces : your first line "elt : {'/Type': ..." should indicate an indirect_object
can you please provide your full test code to see how you are building your PdfWriter (please provide/use a minimal code)

b) the "elt ---" indicates that the object are not indirect_objects which means that they are newly generated that I do not understand but explains the infinite loop
can you please rerun with the following test code (I've seen your note and took that into account however not tested)

        def clean_forms(
            elt: DictionaryObject, stack: List[DictionaryObject]
        ) -> Tuple[List[str], List[str]]:
            nonlocal to_delete
            try:
                print("elt : ",type(elt.indirect_reference), elt.indirect_reference)
            except Exception:
                print("elt  ---",e.hash_value())
            print("stack ind: ", [(s.indirect_reference if "indirect_reference" in s.__dir__() else "---" )for s in stack ])
            print("stack native: ", stack)
            print("in? ",elt in stack)
            if elt in stack:
                # to prevent infinite looping
                return [], []  # pragma: no cover
            try:
                d = cast(
                    Dict[Any, Any],
                    cast(DictionaryObject, elt["/Resources"])["/XObject"],
                )
            except KeyError:
                d = {}
            images = []
            forms = []
            for k, v in d.items():
                print("k",k)
                o = v.get_object()
                try:
                    content: Any = None
                    if (
                        to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
                        and o["/Subtype"] == "/Image"
                    ):
                        content = NullObject()  # to delete the image keeping the entry
                        images.append(k)
                    if o["/Subtype"] == "/Form":
                        forms.append(k)
                        print("hash(",k,"): ",o.hash_value()," in ",d.hash_value())
                        print("value :",type(v),v)
                        if isinstance(o, ContentStream):
                            content = o
                        else:
                            content = ContentStream(o, self)
                            content.update(
                                {
                                    k1: v1
                                    for k1, v1 in o.items()
                                    if k1 not in ["/Length", "/Filter", "/DecodeParms"]
                                }
                            )
                        stack.append(elt)
                        print("call")
                        clean_forms(content, stack)  # clean sub forms
                    if content is not None:
                        if isinstance(v, IndirectObject):
                            self._objects[v.idnum - 1] = content
                        else:
                            # should only occur with pdf not respecting pdf spec
                            # where streams must be indirected.
                            d[k] = self._add_object(content)  # pragma: no cover
                except (TypeError, KeyError):
                    pass
            for im in images:
                del d[im]  # for clean-up
            if isinstance(elt, StreamObject):  # for /Form
                if not isinstance(elt, ContentStream):  # pragma: no cover
                    e = ContentStream(elt, self)
                    e.update(elt.items())
                    elt = e
                clean(elt, images, forms)  # clean the content
            return images, forms

        if not isinstance(page, PageObject):
            page = PageObject(self, page.indirect_reference)  # pragma: no cover
        if "/Contents" in page:
            content = cast(ContentStream, page.get_contents())

            images, forms = clean_forms(page, [])

            clean(content, images, forms)
            page.replace_contents(content)

@syan-dev
Copy link
Author

syan-dev commented Mar 5, 2024

Hi @pubpub-zz

a, Here is the test code I am using:

import io
from pypdf import PdfReader, PdfWriter

pdf_content = open('file_name.pdf', 'rb').read()

pdf_writer = PdfWriter()
pdf_reader = PdfReader(io.BytesIO(pdf_content))

pdf_writer.add_page(pdf_reader.pages[0])
pdf_writer.remove_text()

b, Here is the log I got with the new code you provided

log2.txt

@pubpub-zz
Copy link
Collaborator

I start to understand.
can you retry with this code (may fix the issue):

        def clean_forms(
            elt: DictionaryObject, stack: List[DictionaryObject]
        ) -> Tuple[List[str], List[str]]:
            nonlocal to_delete
            try:
                print("elt : ",type(elt.indirect_reference), elt.indirect_reference.__repr__())
            except Exception:
                print("elt  ---",e.hash_value())
            print("stack ind: ", [(s.indirect_reference if "indirect_reference" in s.__dir__() else "---" )for s in stack ])
            print("stack native: ", stack)
            print("in? ",elt in stack)
            if hasattr(elt, "indirect_reference"):
                if any(( elt.indirect_reference == getattr(x,"indirect_reference",-1) for x in stack)):
                    return [], []
            if elt in stack:
                # to prevent infinite looping
                return [], []  # pragma: no cover
            try:
                d = cast(
                    Dict[Any, Any],
                    cast(DictionaryObject, elt["/Resources"])["/XObject"],
                )
            except KeyError:
                d = {}
            images = []
            forms = []
            for k, v in d.items():
                print("k",k)
                o = v.get_object()
                try:
                    content: Any = None
                    if (
                        to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
                        and o["/Subtype"] == "/Image"
                    ):
                        content = NullObject()  # to delete the image keeping the entry
                        images.append(k)
                    if o["/Subtype"] == "/Form":
                        forms.append(k)
                        print("hash(",k,"): ",o.hash_value()," in ",d.hash_value())
                        print("value :",type(v),v)
                        if isinstance(o, ContentStream):
                            content = o
                        else:
                            content = ContentStream(o, self)
                            content.update(
                                {
                                    k1: v1
                                    for k1, v1 in o.items()
                                    if k1 not in ["/Length", "/Filter", "/DecodeParms"]
                                }
                            )
                            try:
                                content.indirect_reference = o.indirect_reference
                            except AttributeError:
                                pass
                        stack.append(elt)
                        print("call")
                        clean_forms(content, stack)  # clean sub forms
                    if content is not None:
                        if isinstance(v, IndirectObject):
                            self._objects[v.idnum - 1] = content
                        else:
                            # should only occur with pdf not respecting pdf spec
                            # where streams must be indirected.
                            d[k] = self._add_object(content)  # pragma: no cover
                except (TypeError, KeyError):
                    pass
            for im in images:
                del d[im]  # for clean-up
            if isinstance(elt, StreamObject):  # for /Form
                if not isinstance(elt, ContentStream):  # pragma: no cover
                    e = ContentStream(elt, self)
                    e.update(elt.items())
                    elt = e
                clean(elt, images, forms)  # clean the content
            return images, forms

        if not isinstance(page, PageObject):
            page = PageObject(self, page.indirect_reference)  # pragma: no cover
        if "/Contents" in page:
            content = cast(ContentStream, page.get_contents())

            images, forms = clean_forms(page, [])

            clean(content, images, forms)
            page.replace_contents(content)

@syan-dev
Copy link
Author

syan-dev commented Mar 5, 2024

Thank you very much. Somehow, using stack.append(elt) instead of stack + [elt] fixed the problem. Here is the log for your updated code.
log3.txt

@pubpub-zz
Copy link
Collaborator

Thank you very much. Somehow, using stack.append(elt) instead of stack + [elt] fixed the problem. Here is the log for your updated code. log3.txt

Can you confirm that it is only this last proposal that fixes the issue ?

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Mar 6, 2024
@pubpub-zz
Copy link
Collaborator

@syanng what would be great would be to have a test file. removing text and replacing all image with some dumb image(looping through images in a page and replacing it with a test image), could you try to produce a test file ?

@syan-dev
Copy link
Author

syan-dev commented Mar 7, 2024

Yes, the issue has been fixed. Thank you very much. For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images?

@stefan6419846
Copy link
Collaborator

For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images?

You might want to have a look at the remove_* methods of PdfWriter as in the docs: https://pypdf.readthedocs.io/en/latest/modules/PdfWriter.html#pypdf.PdfWriter.remove_links

@pubpub-zz
Copy link
Collaborator

fixed in #2505

@pubpub-zz pubpub-zz closed this Mar 16, 2024
stefan6419846 pushed a commit that referenced this pull request Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold PR requests that need clarification before they can be merged.A comment must give details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: clean_forms function cause infinite looping if elt["/Resources"] have circular relation
3 participants