-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROB: ignore_eof everywhere for read_until_regex #1521
Conversation
Codecov ReportBase: 91.79% // Head: 91.79% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1521 +/- ##
==========================================
- Coverage 91.79% 91.79% -0.01%
==========================================
Files 33 33
Lines 6093 6091 -2
Branches 1200 1199 -1
==========================================
- Hits 5593 5591 -2
Misses 323 323
Partials 177 177
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This was initially motivated by `NumberObject.read_from_stream`, which was calling `read_until_regex` with the default value of `ignore_eof=False` and thus raising exceptions like: ``` PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly ``` py-pdf@431ba70 demonstrates a similar fix for `NameObject.read_from_stream`. From discussion in py-pdf#1505, it was realized that the change to `NumberObject.read_from_stream` had now made ALL callers of `read_until_regex` pass `ignore_eof=True`. It's cleaner to remove the parameter entirely and change the default behaviour.
cd70bae
to
8417a83
Compare
@rraval |
That would also help me. At the moment, I don't know what to do with those PRs. |
@pubpub-zz yes, I have code that looks like this: from PyPDF2 import PdfMerger
from pathlib import Path
bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")
merger = PdfMerger(strict=True)
merger.append(str(bad_pdf))
merger.write(str(output_pdf)) Which produces a traceback like this:
Unfortunately I cannot provide Interestingly, if I run with from PyPDF2 import PdfMerger
from pathlib import Path
bad_pdf = Path(__file__).with_name("bad.pdf")
output_pdf = Path(__file__).with_name("merged.pdf")
merger = PdfMerger(strict=False) # <-- this line changed
merger.append(str(bad_pdf))
merger.write(str(output_pdf)) ... the exceptions are turned into warnings:
... and the resulting PDF is valid with the following quirks:
With this patch applied, I get no warnings and no blank pages (all 6 pages are visually equivalent to the original). |
Does https://pypdf.readthedocs.io/en/latest/user/robustness.html help you? The warnings exist to let the user know that the PDF is broken. You can mute them in the user code, but I would not want to remove them (except if they are actually wrong). And |
Not directly, no. But it's good to have a clear definition of the semantics that pypdf intends. Quoting from that page:
I do not know the PDF spec well (... or at all really), but does a stream termination in this fashion explicitly violate the PDF specification? Or is pypdf implementation being overly strict here? If this is indeed a PDF specification violation, then it strikes me as odd that 431ba70 for For what it's worth, I have not been able to find a PDF viewer or processor that fails on the motivating PDF file, I've tried processing via |
@rraval |
I've done as requested. I apologize for not sharing the file here but the file contains sensitive information and I do not have the rights to publish it broadly. |
Thank you @rraval for trusting us. Having example files helps us (well, @pubpub-zz to be honest - I'm mostly just doing sanity checks / releases / answering community questions 😄 ) |
@MartinThoma |
Thank you for your contributions @rraval 🙏 This change will be part of the release tomorrow @pubpub-zz Thank you for helping me with this one again ❤️ |
New Features (ENH): - Add page label support to PdfWriter (#1558) - Accept inline images with space before EI (#1552) - Add circle annotation support (#1556) - Add polygon annotation support (#1557) - Make merging pages produce a deterministic PDF (#1542, #1543) Bug Fixes (BUG): - Fix error in cmap extraction (#1544) - Remove erroneous assertion check (#1564) - Fix dictionary access of optional page label keys (#1562) Robustness (ROB): - Set ignore_eof=True for read_until_regex (#1521) Documentation (DOC): - Paper size (#1550) Developer Experience (DEV): - Fix broken combination of dependencies of docs.txt - Annotate tests appropriately (#1551) [Full Changelog](3.2.1...3.3.0)
This was initially motivated by
NumberObject.read_from_stream
, which was callingread_until_regex
with the default value ofignore_eof=False
and thus raising exceptions like:431ba70 demonstrates a similar fix for
NameObject.read_from_stream
.From discussion in #1505, it was realized that the change to
NumberObject.read_from_stream
had now made ALL callers ofread_until_regex
passignore_eof=True
. It's cleaner to remove the parameter entirely and change the default behaviour.