-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #297 : fix corruption in startxref or xref table #788
Conversation
pubpub-zz
commented
Apr 20, 2022
Proposal for #297 Issue is due to corrupted startxref pointer and/or xref table Fixed proposed if incorrect pointer look for all objs to rebuild an xref table Tests updated
Codecov Report
@@ Coverage Diff @@
## main #788 +/- ##
==========================================
+ Coverage 75.21% 75.31% +0.09%
==========================================
Files 11 12 +1
Lines 3514 3609 +95
Branches 809 835 +26
==========================================
+ Hits 2643 2718 +75
- Misses 658 670 +12
- Partials 213 221 +8
Continue to review full report at Codecov.
|
Tests/test_reader.py
Outdated
with pytest.raises(UserWarning): | ||
reader = PdfFileReader(path, "rb") | ||
reader.getPage(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deactivates the test. At least for strict=False
I would expect reader.extractText()
to give something reasonable.
The rest of the PR might be fine, but it does not fix issue297. I will not merge this part.
cf comments in #788 (comment) Test in strict=False and True; use PdfReadWarning instead of UserWarniing to be consistant with other raising
@MartinThoma |
I understand :-) This might help to capture both cases:
|
@MartinThoma ,
|
@pubpub-zz Thank your for pointing it out - yes, that is ok :-) The parametrize option is better as it shows clearer where an issue is (if there was one) + it avoids a bit of code duplication. But leave it as is - that is just a nitpick; I'll polish that in another commit. Thank you 🤗 I'll not merge it for now. In my roof apartment one of the windows just broke out ... I'm still shocked + cannot concentrate. I'll likely do review it again tomorrow when I can focus again 😅 |
@MartinThoma, good luck with your Window(...s 10 or 11 ? That is the most common window which is breaking down 🤣) |
Good morning! There seems to be something off with
As you can see, the byte string starts with Can you help me understand why the |
(minor comment, feel free to ignore: Running black over just the test is fine: |
due to the formatting, the original formatting string includes a double '%%' that is converted to a single during formating. the point is that the .find(b"xref") applies the search on the original function. therefore the 327 points to the r and not the x of the xref. |
applied only on test_reader as recommanded
Thank you - that was what I missed 🙏 👍 Thank you for fixing the tests!
I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍 |
Test completed to cover old cases loop detected and fix if rebuilding xref table
So did I. I've extended to test to cover both nominal case and incorrect pointer. A loop has been detected and fixed in such cases. |
Thank you very much for all the time and effort you've put into this! I appreciate it! |
@pubpub-zz I've created a follow-up PR that improves this IMHO a little bit. Nothing substantial, but from a maintainers perspective important. I'll add the reasoning in a second: #830 |
This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.
Use PdfReadWarning instead of UserWarning to be consistent Closes #297
This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.
Use PdfReadWarning instead of UserWarning to be consistent Closes py-pdf#297
This refactoring aims at making maintenance easier: 1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out 2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes 3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 4. `_rebuild_xref_table` was moved to a method for the same reason.
Robustness (ROB): - Handle missing destinations in reader (#840) - warn-only in readStringFromStream (#837) - Fix corruption in startxref or xref table (#788 and #830) Documentation (DOC): - Project Governance (#799) - History of PyPDF2 - PDF feature/version support (#816) - More details on text parsing issues (#815) Developer Experience (DEV): - Add benchmark command to Makefile - Ignore IronPython parts for code coverage (#826) Maintenance (MAINT): - Split pdf module (#836) - Separated CCITTFax param parsing/decoding (#841) - Update requirements files Testing (TST): - Use external repository for larger/more PDFs for testing (#820) - Swap incorrect test names (#838) - Add test for PdfFileReader and page properties (#835) - Add tests for PyPDF2.generic (#831) - Add tests for utils, form fields, PageRange (#827) - Add test for ASCII85Decode (#825) - Add test for FlateDecode (#823) - Add test for filters.ASCIIHexDecode (#822) Code Style (STY): - Apply pre-commit (black, isort) + use snake_case variables (#832) - Remove debug code (#828) - Documentation, Variable names (#839) Full Changelog: 1.27.9...1.27.10