Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfWriter.write() in context manager closes stream when it should not #2905

Closed
alexaryn opened this issue Oct 16, 2024 · 5 comments · Fixed by #2909
Closed

PdfWriter.write() in context manager closes stream when it should not #2905

alexaryn opened this issue Oct 16, 2024 · 5 comments · Fixed by #2909
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected

Comments

@alexaryn
Copy link
Contributor

Calling PdfWriter.write(fileobj) unexpectedly closed fileobj causing my program to crash later when it tried to do fileobj.seek(0).

Environment

Which environment were you using when you encountered the problem?

Linux amd64; Python 3.11 and 3.12, but it doesn't matter, as it's a logic bug in the code.

Code

This code will trigger the issue with any PDF input:

def select_pdf_pages(input: BinaryIO, out: BinaryIO, page_list: list[int]) -> None:
    input.seek(0)
    with pypdf.PdfReader(input) as pdf_reader:
        with pypdf.PdfWriter() as pdf_writer:
            for page_num in page_list:
                pdf_writer.add_page(pdf_reader.pages[page_num - 1])
            pdf_writer.write(out)

After calling this, out is closed. It should not be.

Traceback

n/a

Explanation

The problem arises here:

if self.with_as_usage:

which seems like a special case for when write() is called from __exit__(). However, just because the PdfWriter was used in a with ... as statement doesn't mean that the stream passed into write() is the internal self.fileobj. If it belongs to the caller, it should not be messed with.

Potential Solution

Rather than closing in write(), why not close afterward in __exit__()?

Partial Workaround

write_stream() does not have the problematic logic.

@stefan6419846
Copy link
Collaborator

Thanks for the report. Feel free to submit a corresponding PR and test.

@stefan6419846 stefan6419846 added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected labels Oct 16, 2024
@pubpub-zz
Copy link
Collaborator

I do not agree with your comment saying that the stream should not be closed : once you are calling write() from the writer point of view the stream has nothing more to do.
if you wan to keep it open, this should work:

def select_pdf_pages(input: BinaryIO, out: BinaryIO, page_list: list[int]) -> None:
    input.seek(0)
    with pypdf.PdfReader(input) as pdf_reader:
        pdf_writer = pypdf.PdfWriter()
        for page_num in page_list:
            pdf_writer.add_page(pdf_reader.pages[page_num - 1])
        pdf_writer.write(out)

@alexaryn
Copy link
Contributor Author

@pubpub-zz I think there are two arguments against keeping the current behavior.

  • Code like pdf_writer.write(out) should be consistent in its effect without looking back to see how pdf_writer was instantiated.
  • Use of context managers should be encouraged in order to prevent resource waste bugs. The above example suffers from such a bug by not calling pdf_writer.close(). As a rule, context managers clean up just the object instantiated in the with statement.

I'd also counter the supposition that the stream is useless after write(). It's up to the caller to decide if it has further uses for the stream. One common pattern in the unix world is to hold open a file that has been unlinked, useful for auto-cleaning temporary files.

@MasterOdin
Copy link
Member

MasterOdin commented Oct 18, 2024

I agree with @alexaryn in that responsibility of closing a stream should rest purely on the caller of PdfWriter, and not PdfWriter itself. Calling pdf_writer.close() or using the context manager should only clean up the resources that PdfWriter itself has created as part of its operation. I apologize for not catching the current behavior when I reviewed #1193. 😬

Looking at the code, there is also the inefficiency where PdfWriter.__init__ is called twice. It's also permissible to call PdfWriter with a PdfReader as the first argument (to simplify cloning), but in doing so with contexts, will hit the following error:

>>> with PdfReader("./sample-files/001-trivial/minimal-document.pdf") as reader:
...   with PdfWriter(reader) as writer:
...     print(writer.fileobj)
...
__init__
__enter__
__init__
<pypdf._reader.PdfReader object at 0x10304e970>
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 373, in __exit__
    self.write(self.fileobj)
  File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 1396, in write
    self.write_stream(stream)
  File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 1367, in write_stream
    object_positions, free_objects = self._write_pdf_structure(stream)
  File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 1500, in _write_pdf_structure
    stream.write(self.pdf_header.encode() + b"\n")
AttributeError: 'PdfReader' object has no attribute 'write'

I'm not sure what the correct behavior here should be, especially with regards to cloning. Like, if I call:

with PdfWriter("foo.pdf") as writer:

and foo.pdf exists, do I expect that it'll be cloned into the writer? I'd think no? However, I think this does force the second __init__ call to be made as there's no way to tell in the first __init__ that we're in a context. I'd think this could be confusing for end users, but I guess not since no one has written in about this behavior.

@pubpub-zz
Copy link
Collaborator

@MasterOdin / @alexaryn, I've created a new issue with the @MasterOdin's last post as there is some issues there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants