Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: index out of range when encountering a digital certificate/signature #1245

Closed
Bryan-Fagan opened this issue Aug 16, 2022 · 3 comments · Fixed by #1258
Closed

Comments

@Bryan-Fagan
Copy link

I'll start with I'm very new to using Python and PyPDF. I'm trying to collect all of the fields within a pdf to collect into a dataframe. Eventually I want to collect thousands of PDFs that all have the same structure (form) as the baseline and place them into the PDF. I was able to get this code to work great on a PDF without a digital certificate/signature. However, when I run the code on a PDF with the digital certificate/signature I get an error.

I don't really need the digital signature/certificate spot of the document so I think the easiest way to do this is to just skip that field of the PDF. However, I don't know how to do that since the PyPDF2 package looks at every field.

I was able to get around the error by doing try/except but then it wouldn't capture the information from the pdf (i.e. result was blank).

Environment

Plotly Dash Workspace

$ python -m platform
# TODO: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-debian-buster-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
# TODO: 2.10.0

Code + PDF

import PyPDF2 as pypdf

directory = 'files'

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f):
        print(f)
        pdf=pypdf.PdfFileReader(f, strict= False)
        print(pdf)
        #information = pdf.getFormTextFields()
        information = pdf.getFields()
        print(information)
        output = pd.DataFrame([information])
        df = pd.concat([df, output], ignore_index=True)

I'll have to play around with the PDF to see if I can post it as it have PII information.

Traceback

Traceback (most recent call last):
  File "/workspace/app.py", line 77, in <module>
    information = pdf.getFields()
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 526, in getFields
    return self.get_fields(tree, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 510, in get_fields
    self._build_field(field, retval, fileobj, field_attributes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 535, in _build_field
    self._check_kids(field, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
    self.get_fields(kid.get_object(), retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 499, in get_fields
    self._check_kids(tree, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
    self.get_fields(kid.get_object(), retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 503, in get_fields
    self._build_field(tree, retval, fileobj, field_attributes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 547, in _build_field
    retval[key] = Field(field)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 1626, in __init__
    self[NameObject(attr)] = data[attr]
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 679, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 251, in get_object
    obj = self.pdf.get_object(self)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1167, in get_object
    retval, indirect_reference.idnum, indirect_reference.generation
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 741, in decrypt_object
    return cf.decrypt_object(obj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 185, in decrypt_object
    obj[i] = self.decrypt_object(obj[i])
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 176, in decrypt_object
    data = self.strCrypt.decrypt(obj.original_bytes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 88, in decrypt
    return d[: -d[-1]]
IndexError: index out of range

TODO
I believe the best solution would be something for if the getFields() or getFormFields() methods encounter a digital signature/certificate then it passes that field.

@Bryan-Fagan
Copy link
Author

I keep getting an error uploading the PDF.

https://static.e-publishing.af.mil/production/1/af_a1/form/af707/af707.pdf

This (^) is a blank version.

@pubpub-zz
Copy link
Collaborator

@Bryan-Fagan , thanks for your PDF.
this seems to be related to #1224 and a way formard is proposed there.

pubpub-zz pushed a commit to pubpub-zz/pypdf that referenced this issue Aug 20, 2022
fixes py-pdf#1245: case where AES decrypt returns empty bytestring
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 20, 2022
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 20, 2022
@pubpub-zz
Copy link
Collaborator

refered PR to fix the issue of the decrypt error. however further decoding needs to look at the XFA data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants