Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing function to repair non-reference fields in corrupted PDF #2453

Closed
MisterStump opened this issue Feb 12, 2024 · 4 comments · Fixed by #2480
Closed

missing function to repair non-reference fields in corrupted PDF #2453

MisterStump opened this issue Feb 12, 2024 · 4 comments · Fixed by #2480
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-forms From a users perspective, forms is the affected feature/workflow

Comments

@MisterStump
Copy link

MisterStump commented Feb 12, 2024

Edit 2/14: Title changed. The issue is the PDF sample included does not have its fields data structured correctly. This issue item is now for the possible addition of handling of this form of corruption


On some PDFs, get_fields and get_form_text_fields both return incomplete data. This only occurs on 1 or 2 PDFs I've encountered out of a couple dozen. In the included example PDF, there are 15 fields but only 8 are returned. There are no errors or trackback I can find.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19045-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter
import os

filePath = "ExampleForm.pdf"
fileObj = open(filePath, 'rb')
reader = PdfReader(fileObj)

getFields = reader.get_fields()
getTextFields = reader.get_form_text_fields()

print("getFields: ", len(getFields))
print("getTextFields: ", len(getTextFields))
print(getTextFields)
#Print results:
"""
getFields:  8
getTextFields:  8
{'txtAgentExecName': None, 'txtTitleEx': None, 'Entity Name': None, 'Notice Date': '  ', 'Entity Name (Contact Info)': None, 'Zip': None, 'Registered Agent': '  ', 'Registered Agent (Signature)': ' '}
"""

PDF from Example:
ExampleForm.pdf

You may add this to your tests.

Traceback

This is the complete traceback I see:

No traceback. I don't see any erroring/warning.
@pubpub-zz
Copy link
Collaborator

Your pdf is not in accordance with standards: It contains some fields that are not child of another field (no parent child) and that are not referenced in pdf Root/Acroform/Fields

@MisterStump
Copy link
Author

Thank you, I don't know enough about the format to recognize that, but I understand if it is a PDF-specific. Should I close this out?

@pubpub-zz
Copy link
Collaborator

I propose you to leave it open but to replace the title with "missing function to repair non refrence fields in corrupter PDF" I have in mind a way to propose a new fixing feature

@MisterStump MisterStump changed the title get_fields and get_form_text_fields returning incomplete data missing function to repair non-reference fields in corrupted PDF Feb 14, 2024
@MisterStump
Copy link
Author

Updated the title and description, and I will leave it open. Thank you!

@stefan6419846 stefan6419846 added workflow-forms From a users perspective, forms is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Feb 15, 2024
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Feb 27, 2024
parse page/document annotations for orphan fields and reattach them to AcroForm/Fields
closes py-pdf#2453
stefan6419846 pushed a commit that referenced this issue Feb 28, 2024
Parse page/document annotations for orphan fields and reattach them to AcroForm/Fields
Closes #2453
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-forms From a users perspective, forms is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants