-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PI: Use iterative DFS in PdfWriter._sweep_indirect_references #1072
Conversation
xref indexes has updated.
Hm, some unknown reason py37 and py38 fails but py39 and py310 was ok. |
There is two tests with issues:
I did research them and those tests "succeeded" because this _sweep_indirect_references hit recursionlimit. And it happens because PDF has a linked list over 1000 items. |
Maybe we should work on getting #351 ready first so that we don't hit the recursion limit anymore? |
This can be done in iterative algorithm like that. |
Now this is transformed to iterative version. Some tests needed update because warnings was not raised any more. |
Codecov Report
@@ Coverage Diff @@
## main #1072 +/- ##
==========================================
+ Coverage 91.50% 91.57% +0.07%
==========================================
Files 24 24
Lines 4530 4524 -6
Branches 927 926 -1
==========================================
- Hits 4145 4143 -2
+ Misses 245 241 -4
Partials 140 140
Continue to review full report at Codecov.
|
Wow, this is amazing @Hatell ! Thank you 🙏 🤗 I will review it today, but it might take to the evening :-) |
This is now ready for testing. Main changes is:
One fix need to be done to recalculate all parents hash if dictionary or array object value changes. |
If data is changed then update of keys is done all parents. Added checks to tests to verify that all keys in _idnum_hash is valid.
I think I solved this issue for recalculating hashes when updating a dictionary or array object. |
Thank you so much for all the effort @Hatell ! I've adjusted the title of the PR and the first message of it. I will use them for the squash commit to represent all of the changes done here. Feel free to adjust if you think there should be something added / adjusted. |
If you want, you can also remove the
in |
I'm currently letting a bigger text run through. So far, it looks good. I'm still a tiny bit worried as this is such a core part of PyPDF2 😅 |
Great and thanks for help. |
Thank you for your contribution ❤️ I'll make a release in a couple of hours |
New Features (ENH): - Add PageObject._get_fonts (#1083) - Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067) Performance Improvements (PI): - Use iterative DFS in PdfWriter._sweep_indirect_references (#1072) Bug Fixes (BUG): - Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066) - Column default for CCITTFaxDecode (#1079) Robustness (ROB): - Guard against None-value in _get_outlines (#1060) Documentation (DOC): - Stamps and watermarks (#1082) - OCR vs PDF text extraction (#1081) - Python Version support - Formatting of CHANGELOG Developer Experience (DEV): - Cache downloaded files (#1070) - Speed-up for CI (#1069) Maintenance (MAINT): - Set page.rotate(angle: int) (#1092) - Issue #416 was fixed by #1015 (#1078) Testing (TST): - Image extraction (#1080) - Image extraction (#1077) Code Style (STY): - Apply black - Typo in Changelog Full Changelog: 2.4.2...2.4.3
PdfWriter.external_reference_map
and calculate hash from every referred object and use that to detect duplicate objects.Closes #351
Closes #1036