-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encounter a valid pdf file but PyPDF2 fail on it #88
Comments
I get same error: |
I have the same issue. I attached 3 PDF files that illustrate the problem. from pathlib import Path
import PyPDF2
for path in Path(__file__).parent.glob('form*.pdf'):
with open(path, 'rb') as f:
pdf = PyPDF2.PdfFileReader(f)
fields = pdf.getFormTextFields()
print(path.name, fields) Output:
On Ubuntu 20.10, Python 3.8.6 and PyPDF2 1.26.0 installed through apt. The error occurs while decoding the xref table (the last one in the file, first one to be read). I can also reproduce this if I take only the contents of the xref stream and feed that to zlib: import zlib
data = b'x\x9cbd`\x084a``d`\x08b\x87Pi`*x6\x98\n\xb9\x08\xa6\xc2\xa4\x18\x18\x00\x00\x00\x00\xff\xff'
print(f'data: {data}')
decompressed_data = zlib.decompress(data)
print(f'decompressed_data: {decompressed_data}')
However, if I use zlib's import zlib
data = b'x\x9cbd`\x084a``d`\x08b\x87Pi`*x6\x98\n\xb9\x08\xa6\xc2\xa4\x18\x18\x00\x00\x00\x00\xff\xff'
print(f'data: {data}')
decompressed_data = zlib.decompressobj().decompress(data)
print(f'decompressed_data (obj): {decompressed_data}')
I don't understand why this happens, since zlib's documentation says:
and this stream certainly fits in my memory. But it resolves the issue for me, so maybe it would be good to always use this when decompressing flate encoded streams. |
This should resolve py-pdf#88
Adding fix from pypdf2 py-pdf#88
Co-authored-by: Stef Sijben <[email protected]>
Deprecations (DEP): - Remove support for Python 2.6 and older (#776) New Features (ENH): - Extract document permissions (#320) Bug Fixes (BUG): - Clip by trimBox when merging pages, which would otherwise be ignored (#240) - Add overwriteWarnings parameter PdfFileMerger (#243) - IndexError for getPage() of decryped file (#359) - Handle cases where decodeParms is an ArrayObject (#405) - Updated PDF fields don't show up when page is written (#412) - Set Linked Form Value (#414) - Fix zlib -5 error for corrupt files (#603) - Fix reading more than last1K for EOF (#642) - Acciental import Robustness (ROB): - Allow extra whitespace before "obj" in readObjectHeader (#567) Documentation (DOC): - Link to pdftoc in Sample_Code (#628) - Working with annotations (#764) - Structure history Developer Experience (DEV): - Add issue templates (#765) - Add tool to generate changelog Maintenance (MAINT): - Use grouped constants instead of string literals (#745) - Add error module (#768) - Use decorators for @staticmethod (#775) - Split long functions (#777) Testing (TST): - Run tests in CI once with -OO Flags (#770) - Filling out forms (#771) - Add tests for Writer (#772) - Error cases (#773) - Check Error messages (#769) - Regression test for issue #88 - Regression test for issue #327 Code Style (STY): - Make variable naming more consistent in tests All changes: 1.27.5...1.27.6
Deprecations (DEP): - Remove support for Python 2.6 and older (py-pdf#776) New Features (ENH): - Extract document permissions (py-pdf#320) Bug Fixes (BUG): - Clip by trimBox when merging pages, which would otherwise be ignored (py-pdf#240) - Add overwriteWarnings parameter PdfFileMerger (py-pdf#243) - IndexError for getPage() of decryped file (py-pdf#359) - Handle cases where decodeParms is an ArrayObject (py-pdf#405) - Updated PDF fields don't show up when page is written (py-pdf#412) - Set Linked Form Value (py-pdf#414) - Fix zlib -5 error for corrupt files (py-pdf#603) - Fix reading more than last1K for EOF (py-pdf#642) - Acciental import Robustness (ROB): - Allow extra whitespace before "obj" in readObjectHeader (py-pdf#567) Documentation (DOC): - Link to pdftoc in Sample_Code (py-pdf#628) - Working with annotations (py-pdf#764) - Structure history Developer Experience (DEV): - Add issue templates (py-pdf#765) - Add tool to generate changelog Maintenance (MAINT): - Use grouped constants instead of string literals (py-pdf#745) - Add error module (py-pdf#768) - Use decorators for @staticmethod (py-pdf#775) - Split long functions (py-pdf#777) Testing (TST): - Run tests in CI once with -OO Flags (py-pdf#770) - Filling out forms (py-pdf#771) - Add tests for Writer (py-pdf#772) - Error cases (py-pdf#773) - Check Error messages (py-pdf#769) - Regression test for issue py-pdf#88 - Regression test for issue py-pdf#327 Code Style (STY): - Make variable naming more consistent in tests All changes: py-pdf/pypdf@1.27.5...1.27.6
that file can be decompressed by pdftk, but the FlateDecode of PyPDF2 failed:
here's the data to be decompressed (repr print):
the pdf file can also be opened by osx preview correctly.
The text was updated successfully, but these errors were encountered: