IndexError: index out of range when encountering a digital certificate/signature #1245

Bryan-Fagan · 2022-08-16T10:22:55Z

I'll start with I'm very new to using Python and PyPDF. I'm trying to collect all of the fields within a pdf to collect into a dataframe. Eventually I want to collect thousands of PDFs that all have the same structure (form) as the baseline and place them into the PDF. I was able to get this code to work great on a PDF without a digital certificate/signature. However, when I run the code on a PDF with the digital certificate/signature I get an error.

I don't really need the digital signature/certificate spot of the document so I think the easiest way to do this is to just skip that field of the PDF. However, I don't know how to do that since the PyPDF2 package looks at every field.

I was able to get around the error by doing try/except but then it wouldn't capture the information from the pdf (i.e. result was blank).

Environment

Plotly Dash Workspace

$ python -m platform
# TODO: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-debian-buster-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
# TODO: 2.10.0

Code + PDF

import PyPDF2 as pypdf

directory = 'files'

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f):
        print(f)
        pdf=pypdf.PdfFileReader(f, strict= False)
        print(pdf)
        #information = pdf.getFormTextFields()
        information = pdf.getFields()
        print(information)
        output = pd.DataFrame([information])
        df = pd.concat([df, output], ignore_index=True)

I'll have to play around with the PDF to see if I can post it as it have PII information.

Traceback

Traceback (most recent call last):
  File "/workspace/app.py", line 77, in <module>
    information = pdf.getFields()
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 526, in getFields
    return self.get_fields(tree, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 510, in get_fields
    self._build_field(field, retval, fileobj, field_attributes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 535, in _build_field
    self._check_kids(field, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
    self.get_fields(kid.get_object(), retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 499, in get_fields
    self._check_kids(tree, retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 555, in _check_kids
    self.get_fields(kid.get_object(), retval, fileobj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 503, in get_fields
    self._build_field(tree, retval, fileobj, field_attributes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 547, in _build_field
    retval[key] = Field(field)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 1626, in __init__
    self[NameObject(attr)] = data[attr]
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 679, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/generic.py", line 251, in get_object
    obj = self.pdf.get_object(self)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_reader.py", line 1167, in get_object
    retval, indirect_reference.idnum, indirect_reference.generation
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 741, in decrypt_object
    return cf.decrypt_object(obj)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 185, in decrypt_object
    obj[i] = self.decrypt_object(obj[i])
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 176, in decrypt_object
    data = self.strCrypt.decrypt(obj.original_bytes)
  File "/app/.heroku/python/lib/python3.7/site-packages/PyPDF2/_encryption.py", line 88, in decrypt
    return d[: -d[-1]]
IndexError: index out of range

TODO
I believe the best solution would be something for if the getFields() or getFormFields() methods encounter a digital signature/certificate then it passes that field.

The text was updated successfully, but these errors were encountered:

Bryan-Fagan · 2022-08-16T10:47:27Z

I keep getting an error uploading the PDF.

https://static.e-publishing.af.mil/production/1/af_a1/form/af707/af707.pdf

This (^) is a blank version.

pubpub-zz · 2022-08-20T12:39:17Z

@Bryan-Fagan , thanks for your PDF.
this seems to be related to #1224 and a way formard is proposed there.

fixes py-pdf#1245: case where AES decrypt returns empty bytestring

fix py-pdf#1245

pubpub-zz · 2022-08-20T13:58:35Z

refered PR to fix the issue of the decrypt error. however further decoding needs to look at the XFA data

Closes #1245

pubpub-zz pushed a commit to pubpub-zz/pypdf that referenced this issue Aug 20, 2022

ROB : fixes AES returns empty

ca3cc72

fixes py-pdf#1245: case where AES decrypt returns empty bytestring

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 20, 2022

ROB : fix decryt returning empty bytestring

7e4351e

fix py-pdf#1245

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 20, 2022

ROB : fix decryt returning empty bytestring

9e68818

fix py-pdf#1245

pubpub-zz mentioned this issue Aug 20, 2022

ROB : fix decryt returning empty bytestring #1258

Merged

MartinThoma closed this as completed in #1258 Aug 21, 2022

MartinThoma pushed a commit that referenced this issue Aug 21, 2022

ROB: Decrypt returns empty bytestring (#1258)

cf3aab4

Closes #1245

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: index out of range when encountering a digital certificate/signature #1245

IndexError: index out of range when encountering a digital certificate/signature #1245

Bryan-Fagan commented Aug 16, 2022

Bryan-Fagan commented Aug 16, 2022

pubpub-zz commented Aug 20, 2022

pubpub-zz commented Aug 20, 2022

IndexError: index out of range when encountering a digital certificate/signature #1245

IndexError: index out of range when encountering a digital certificate/signature #1245

Comments

Bryan-Fagan commented Aug 16, 2022

Environment

Code + PDF

Traceback

Bryan-Fagan commented Aug 16, 2022

pubpub-zz commented Aug 20, 2022

pubpub-zz commented Aug 20, 2022