Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #297 : fix corruption in startxref or xref table #788

Merged
merged 6 commits into from
Apr 27, 2022
Merged

Fix #297 : fix corruption in startxref or xref table #788

merged 6 commits into from
Apr 27, 2022

Conversation

pubpub-zz
Copy link
Collaborator

Proposal for #297

Issue is due to corrupted startxref pointer and/or xref table
Fixed proposed if incorrect pointer look for all objs to rebuild an xref table

Tests updated

    Proposal for #297

    Issue is due to corrupted startxref pointer and/or xref table
    Fixed proposed if incorrect pointer look for all objs to rebuild an xref table

    Tests updated
@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2022

Codecov Report

Merging #788 (2852248) into main (e7fe707) will increase coverage by 0.09%.
The diff coverage is 95.34%.

@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
+ Coverage   75.21%   75.31%   +0.09%     
==========================================
  Files          11       12       +1     
  Lines        3514     3609      +95     
  Branches      809      835      +26     
==========================================
+ Hits         2643     2718      +75     
- Misses        658      670      +12     
- Partials      213      221       +8     
Impacted Files Coverage Δ
PyPDF2/pdf.py 81.72% <95.34%> (-0.11%) ⬇️
PyPDF2/merger.py 68.07% <0.00%> (-4.87%) ⬇️
PyPDF2/generic.py 67.84% <0.00%> (ø)
PyPDF2/__init__.py 100.00% <0.00%> (ø)
PyPDF2/papersizes.py 100.00% <0.00%> (ø)
PyPDF2/xmp.py 50.27% <0.00%> (+7.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e7fe707...2852248. Read the comment docs.

@MartinThoma
Copy link
Member

#297

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 21, 2022
Comment on lines 219 to 222
with pytest.raises(UserWarning):
reader = PdfFileReader(path, "rb")
reader.getPage(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deactivates the test. At least for strict=False I would expect reader.extractText() to give something reasonable.

The rest of the PR might be fine, but it does not fix issue297. I will not merge this part.

@MartinThoma MartinThoma added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Apr 22, 2022
cf comments in #788 (comment)

Test in strict=False and True;
use PdfReadWarning instead of UserWarniing to be consistant with other raising
@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
Thanks for your review. I do not master pytest. Test reviewed to check both Strict True and False.

@MartinThoma
Copy link
Member

I understand :-)

This might help to capture both cases:

@pytest.mark.parametrize(
    "set_strict, [True, False]
)
def test_issue297(set_strict):
    .... # the test: you pass set_strict to the PdfFileReader and you only do 'with pytest.raises' if set_strict is True

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma ,
The expected result not being the same I've included both in the test manually isn't this good for you ?

def test_issue297():
    path = os.path.join(RESOURCE_ROOT, "issue-297.pdf")
    with pytest.raises(PdfReadWarning) as exc:
        print(exc)
        reader = PdfFileReader(path,strict=True)
        reader.getPage(0)
    assert ( exc.value.args[0].find("startxref") > 0)
    reader = PdfFileReader(path,strict=False)
    reader.getPage(0)

@MartinThoma
Copy link
Member

@pubpub-zz Thank your for pointing it out - yes, that is ok :-) The parametrize option is better as it shows clearer where an issue is (if there was one) + it avoids a bit of code duplication. But leave it as is - that is just a nitpick; I'll polish that in another commit. Thank you 🤗

I'll not merge it for now. In my roof apartment one of the windows just broke out ... I'm still shocked + cannot concentrate. I'll likely do review it again tomorrow when I can focus again 😅

@pubpub-zz
Copy link
Collaborator Author

@MartinThoma, good luck with your Window(...s 10 or 11 ? That is the most common window which is breaking down 🤣)

@MartinThoma MartinThoma added soon PRs that are almost ready to be merged, issues that get solved pretty soon and removed needs-change The PR/issue cannot be handled as issue and needs to be improved labels Apr 23, 2022
@MartinThoma
Copy link
Member

Good morning!

There seems to be something off with test_read_unknown_zero_pages. You've commented, that this is necessary due to the double-percentage in the beginning (%%PDF-1.7). However, this is a part of the percentage formatting. The actual formatted byte-string looks like this:

%PDF-1.7
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
2 0 obj << >> endobj
3 0 obj << >> endobj
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0] /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R /Resources << /Font << >> >> /Rotate 0 /Type /Page >> endobj
5 0 obj << /Pages 0 0 R /Type /Catalog >> endobj
xref 1 5
0000000010 00000 n
0000000067 00000 n
0000000088 00000 n
0000000109 00000 n
0000000278 00000 n
trailer << /Root 5 1 R /Size 6 >>
startxref 327
%%EOF'

As you can see, the byte string starts with %PDF-1.7 and not %%PDF-1.7.

Can you help me understand why the -1 is now necessary?

@MartinThoma
Copy link
Member

(minor comment, feel free to ignore: Running black over just the test is fine: black Tests/test_reader.py. Please don't run it over the complete codebase as this would add to many changes, but I already had it running over the Tests. For this reason it will just beautify your additions)

@pubpub-zz
Copy link
Collaborator Author

As you can see, the byte string starts with %PDF-1.7 and not %%PDF-1.7.

Can you help me understand why the -1 is now necessary?

due to the formatting, the original formatting string includes a double '%%' that is converted to a single during formating. the point is that the .find(b"xref") applies the search on the original function. therefore the 327 points to the r and not the x of the xref.
the original code was not checking for the xref not reporting any error being tolerant to 'invalid' pdf data.
with the new code, strict=true will reject the file - not compliant with the spec - but with string=false, the xref table will be rebuilt, being able to accept the file
@MartinThoma , Do you think it would be better to accept this pointer error as it used to be without rebuilding the xref table?

applied only on test_reader as recommanded
@MartinThoma
Copy link
Member

MartinThoma commented Apr 24, 2022

the .find(b"xref") applies the search on the original

Thank you - that was what I missed 🙏 👍 Thank you for fixing the tests!

Do you think it would be better to accept this pointer error as it used to be without rebuilding the xref table?

I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍
The part that I'm worried about is broken PDFs, that we could still parse before.

Test completed to cover old cases
loop detected and fix if rebuilding xref table
@pubpub-zz
Copy link
Collaborator Author

I fear that we might now reject / fail to parse a lot of PDFs that we could parse beforehand. Fixing the xref table if we can is pretty nice - and raising an exception if it's broken as well 👍
The part that I'm worried about is broken PDFs, that we could still parse before.

So did I. I've extended to test to cover both nominal case and incorrect pointer. A loop has been detected and fixed in such cases.

Tests/test_reader.py Outdated Show resolved Hide resolved
Tests/test_reader.py Outdated Show resolved Hide resolved
Tests/test_reader.py Outdated Show resolved Hide resolved
Tests/test_reader.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

Thank you very much for all the time and effort you've put into this! I appreciate it!

MartinThoma added a commit that referenced this pull request Apr 27, 2022
MartinThoma added a commit that referenced this pull request Apr 27, 2022
@MartinThoma
Copy link
Member

@pubpub-zz I've created a follow-up PR that improves this IMHO a little bit. Nothing substantial, but from a maintainers perspective important. I'll add the reasoning in a second: #830

MartinThoma added a commit that referenced this pull request Apr 27, 2022
This refactoring aims at making maintenance easier:

1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out
2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes
3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 
4. `_rebuild_xref_table` was moved to a method for the same reason.
MartinThoma pushed a commit that referenced this pull request Apr 28, 2022
Use PdfReadWarning instead of UserWarning to be consistent

Closes #297
MartinThoma added a commit that referenced this pull request Apr 28, 2022
This refactoring aims at making maintenance easier:

1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out
2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes
3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 
4. `_rebuild_xref_table` was moved to a method for the same reason.
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
Use PdfReadWarning instead of UserWarning to be consistent

Closes py-pdf#297
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
This refactoring aims at making maintenance easier:

1. Too long functions make it hard to grasp the overall behavior. Hence the _get_xref_issues function was split out
2. `_get_xref_issues` is made a static method of the PdfFileReader to show that it belongs to the reader, but doesn't require any of its attributes
3. `_get_xref_issues` makes use of an integer return value instead of raising + catching exceptions. 
4. `_rebuild_xref_table` was moved to a method for the same reason.
MartinThoma added a commit that referenced this pull request May 1, 2022
Robustness (ROB):
-  Handle missing destinations in reader (#840)
-  warn-only in readStringFromStream (#837)
-  Fix corruption in startxref or xref table (#788 and #830)

Documentation (DOC):
-  Project Governance (#799)
-  History of PyPDF2
-  PDF feature/version support (#816)
-  More details on text parsing issues (#815)

Developer Experience (DEV):
-  Add benchmark command to Makefile
-  Ignore IronPython parts for code coverage (#826)

Maintenance (MAINT):
-  Split pdf module (#836)
-  Separated CCITTFax param parsing/decoding (#841)
-  Update requirements files

Testing (TST):
-  Use external repository for larger/more PDFs for testing (#820)
-  Swap incorrect test names (#838)
-  Add test for PdfFileReader and page properties (#835)
-  Add tests for PyPDF2.generic (#831)
-  Add tests for utils, form fields, PageRange (#827)
-  Add test for ASCII85Decode (#825)
-  Add test for FlateDecode (#823)
-  Add test for filters.ASCIIHexDecode (#822)

Code Style (STY):
-  Apply pre-commit (black, isort) + use snake_case variables (#832)
-  Remove debug code (#828)
-  Documentation, Variable names (#839)

Full Changelog: 1.27.9...1.27.10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants