Recursion error when using `clone_from` of PdfWriter on PDF 2.0 specification #2839

stefan6419846 · 2024-09-08T18:53:03Z

Environment

$ python -m platform
Linux-6.8.0-100039-tuxedo-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0

The version effectively is the latest main code.

Code + PDF

This is a minimal, complete example that shows the issue:

>>> from pypdf import PdfWriter
>>> writer = PdfWriter(clone_from='ISO_32000-2-2020_sponsored.pdf')

Using PdfReader and iterating over the pages extracting the text does not fail.

I cannot share the document (1003 pages) here as it is the non-public copy of the PDF 2.0 specification available for free on https://pdfa.org/sponsored-standards/

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 233, in __init__
    self.clone_document_from_reader(clone_from)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 1150, in clone_document_from_reader
    self.clone_reader_document_root(reader)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 1119, in clone_reader_document_root
    self._root_object = reader.root_object.clone(self)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
[...]
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 129, in clone
    arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 129, in clone
    arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 129, in clone
    arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 266, in clone
    obj = self.get_object()
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 286, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_reader.py", line 381, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_reader.py", line 315, in _get_object_from_stream
    obj_stm: EncodedStreamObject = IndirectObject(stmnum, 0, self).get_object()  # type: ignore
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 286, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_reader.py", line 442, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 1305, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 562, in read_from_stream
    if isinstance(length, IndirectObject):
  File "/usr/lib/python3.10/typing.py", line 1503, in __instancecheck__
    issubclass(instance.__class__, cls)):
RecursionError: maximum recursion depth exceeded

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-09-09T19:12:37Z

I have the same behavior in windows/python 3.10.5

however when upgrading to 3.13 (standard) the file can be loaded successfully setting recursionlimit to 5000 (on python 3.10 there is "crash" with stack overflow)

pubpub-zz · 2024-09-12T21:12:56Z

@stefan6419846
I propose to close this and convert it into a discussion for history

stefan6419846 · 2024-09-13T07:39:21Z

I do not think we should convert this into a discussion, as this surely is some bug/limitation. Is there any reason why this would not fail for the reader, but for the writer? In any case, I recommend documenting the reason for this inside our docs and propose possible workarounds, like increasing the recursion limit (with an example) or splitting large documents beforehand.

pubpub-zz · 2024-09-19T21:08:27Z

I do not think we should convert this into a discussion, as this surely is some bug/limitation. Is there any reason why this would not fail for the reader, but for the writer?

Yes : the objects are only read/loaded/cached into memory when required. in the current design The PdfWriter sucks/clones the root object and all linked objects recursively.

In any case, I recommend documenting the reason for this inside our docs and propose possible workarounds, like increasing the recursion limit (with an example) or splitting large documents beforehand.

then I would propose to add in the document:
"when cloning or merging a document, some recursion error may be experienced. you could try to increase recursive_depth. You may also try some newer python version"

allow to load hudge files closes py-pdf#2839

Allow to load huge files. Closes #2839.

stefan6419846 added the PdfWriter The PdfWriter component is affected label Sep 8, 2024

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 22, 2024

ENH: add full parameter to PdfWriter constructor

c29a426

allow to load hudge files closes py-pdf#2839

pubpub-zz mentioned this issue Sep 22, 2024

ENH: add full parameter to PdfWriter constructor #2865

Merged

stefan6419846 closed this as completed in #2865 Sep 25, 2024

stefan6419846 pushed a commit that referenced this issue Sep 25, 2024

ENH: Add fll parameter to PdfWriter constructor (#2865)

dcd15aa

Allow to load huge files. Closes #2839.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursion error when using `clone_from` of PdfWriter on PDF 2.0 specification #2839

Recursion error when using `clone_from` of PdfWriter on PDF 2.0 specification #2839

stefan6419846 commented Sep 8, 2024 •

edited

Loading

pubpub-zz commented Sep 9, 2024

pubpub-zz commented Sep 12, 2024

stefan6419846 commented Sep 13, 2024

pubpub-zz commented Sep 19, 2024

Recursion error when using clone_from of PdfWriter on PDF 2.0 specification #2839

Recursion error when using clone_from of PdfWriter on PDF 2.0 specification #2839

Comments

stefan6419846 commented Sep 8, 2024 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Sep 9, 2024

pubpub-zz commented Sep 12, 2024

stefan6419846 commented Sep 13, 2024

pubpub-zz commented Sep 19, 2024

Recursion error when using `clone_from` of PdfWriter on PDF 2.0 specification #2839

Recursion error when using `clone_from` of PdfWriter on PDF 2.0 specification #2839

stefan6419846 commented Sep 8, 2024 •

edited

Loading