-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PI: Check duplicate objects in writer._sweep_indirect_references #207
Conversation
Hey! It's been 7 years, but your PR was not forgotten! Would you mind to either re-do the PR or to fix the merge conflicts? If not, I can take care of it and give credit to you via the Github co-authored-by feature |
I can take a look this soon. |
Now I think this is ready to testing. I done one quick test and verified that merged PDF is smaller and looks same. |
Now tests should pass. |
I'm very sorry that I didn't merge it 🙈 There were just too many open issues / PRs - i completely missed that one 🙈 |
Would you mind doing the changes again with the latest I would completely understand if you don't want to do that. In that case I would suggest that I take care of it and give the credit to you in the commit message. With Githubs "co-authored-by" feature you would also be seen as an author of the commit. |
I take a look. |
Codecov Report
@@ Coverage Diff @@
## main #207 +/- ##
==========================================
- Coverage 89.60% 89.59% -0.01%
==========================================
Files 24 24
Lines 4425 4441 +16
Branches 914 917 +3
==========================================
+ Hits 3965 3979 +14
Misses 314 314
- Partials 146 148 +2
Continue to review full report at Codecov.
|
4eb187e
to
9fab63a
Compare
I think it should work now. I did few tests and it worked well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
@pubpub-zz @MasterOdin What do you think? |
@Hatell If nobody has doubts on that one, I'll merge + release latest on Sunday evening. |
I'm running a test over thousands of PDFs with essentially this code: from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader(path)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.add_metadata(reader.metadata) Then I check the file size before and after. When the difference is over 1 MB, I print it for manual inspection. Here is what I found so far: ( |
Only by this PRThe size reduction also already happens with the current
|
@Hatell Using the approach above, a few files failed that were working before:
|
Thanks for tests. I check what happens in those files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like the usage of self._idnum_hash
and extern_map
share a slightly similar purpose, that could we get rid of the former altogether? With this change, could it be that cases where we would avoid the if
surrounding if
statement of the change we'll now enter as we've not written to extern_map
for those deep self-referential objects.
Now those NoneType object has no attribute is handled. |
Thank you for your contribution! It will be part of the next release 🎉 (I'm uncertain when that will be, but likely on 3rd of July) |
New Features (ENH): - Add writer.pdf_header property (getter and setter) (#1038) Performance Improvements (PI): - Remove b_ call in FloatObject.write_to_stream (#1044) - Check duplicate objects in writer._sweep_indirect_references (#207) Documentation (DOC): - How to surppress exceptions/warnings/log messages (#1037) - Remove hyphen from lossless (#1041) - Compression of content streams (#1040) - Fix inconsistent variable names in add-watermark.md (#1039) - File size reduction - Add CHANGELOG to the rendered docs (#1023) Maintenance (MAINT): - Handle XML error when reading XmpInformation (#1030) - Deduplicate Code / add mutmut config (#1022) Code Style (STY): - Use unnecessary one-line function / class attribute (#1043) - Docstring formatting (#1033) Full Changelog: 2.4.0...2.4.1
Caused-by: #207 Why it wasn't detected by the tests: We don't have any tests that check for the correct result of a merge. We just check for exceptions How we prevent it in future: Unit test was added Risk of the fix: - We will have bigger file sizes again as #207 was effectively reverted - We will need to adjust this test if we change the way we write PDFs Closes: #1062
Caused-by: #207 Why it wasn't detected by the tests: We don't have any tests that check for the correct result of a merge. We just check for exceptions How we prevent it in future: Unit test was added Risk of the fix: - We will have bigger file sizes again as #207 was effectively reverted - We will need to adjust this test if we change the way we write PDFs Closes: #1062
Caused-by: #207 Why it wasn't detected by the tests: We don't have any tests that check for the correct result of a merge. We just check for exceptions How we prevent it in future: Unit test was added Risk of the fix: - We will have bigger file sizes again as #207 was effectively reverted - We will need to adjust this test if we change the way we write PDFs Closes: #1062
PdfFileWriter._sweepIndirectReferences checks if object is already in pdf.
This check will reduce file-size significantly when merge pdfs and there are same objects, for example images.