Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 10: #183

Closed
ghost opened this issue Mar 2, 2015 · 22 comments
Closed

ValueError: invalid literal for int() with base 10: #183

ghost opened this issue Mar 2, 2015 · 22 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@ghost
Copy link

ghost commented Mar 2, 2015

Using latest version: PyPDF2-1.24.tar.gz
With code:

import PyPDF2 as pyPdf
inputpdf = pyPdf.PdfFileReader(open('file.pdf', 'rb'))

ValueError: invalid literal for int() with base 10: '2pGF'

lines of pdf:
line 143 - >>
line 144 - endobj
line 145 - 16 0 obj <</Length 8905 /Filter[/A85 /Fl]>> stream
line 146 - Gb![snip]2bGF[snip]J~>

If I import the full string (Gb![snip]2bGF[snip]J~) into python and use a85decode, I get the proper byte array.

@fgeek
Copy link

fgeek commented Mar 7, 2015

Sample file in http://bugs.fi/media/afl/pypdf2/pypdf2-afl-invalid-literal-int-with-base-10.pdf (SHA1 9d25406c4a3c9f5ea61bc96f9251d2f7f186ebf7) with following Python code demonstrates this issue and can be used as a reproducer. Fuzzed with American fuzzy lop and https://bitbucket.org/jwilk/python-afl.

import PyPDF2 as pyPdf
input = pyPdf.PdfFileReader(open('pypdf2-afl-invalid-literal-int-with-base-10.pdf', 'rb'))
print "document1.pdf has %d pages." % input.getNumPages()
Traceback (most recent call last):
  File "crasher.py", line 3, in <module>
    print "document1.pdf has %d pages." % input.getNumPages()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 983, in getNumPages
    self._flatten()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1280, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/generic.py", line 501, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/generic.py", line 177, in getObject
    return self.pdf.getObject(self).getObject()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1372, in getObject
    idnum, generation = self.readObjectHeader(self.stream)
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1440, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: '\xb1'

@StErMi
Copy link

StErMi commented May 19, 2016

Did you find a way to workaround this issue?

@mstamy2
Copy link
Collaborator

mstamy2 commented May 20, 2016

Unfortunately, not all the ValueError: invalid lit... issues are as related as they appear to be.

They generally just indicate a parsing error, and they occur frequently when the file deviates from the PDF standard in some way.

The good news is, parsing errors aren't terribly difficult to track down, provided I can access the file that triggers them.

That said, if anyone would like to submit a PDF I would be happy to take a look (the link in the second comment is broken).

@fgeek
Copy link

fgeek commented May 21, 2016

It is working OK for me (owner of that site).

wget http://bugs.fi/media/afl/pypdf2/pypdf2-afl-invalid-literal-int-with-base-10.pdf
hsalo@tunkki:$ file pypdf2-afl-invalid-literal-int-with-base-10.pdf
pypdf2-afl-invalid-literal-int-with-base-10.pdf: PDF document, version 1.0
hsalo@tunkki:
$ md5sum pypdf2-afl-invalid-literal-int-with-base-10.pdf
073c37cc362031f5550a89977137621f pypdf2-afl-invalid-literal-int-with-base-10.pdf

@mstamy2
Copy link
Collaborator

mstamy2 commented May 26, 2016

It seems that PDF is invalid (can't be opened by any conforming reader), so PyPDF2 would be expected to fail when reading it.

That said, it is misleading because it seems to be read successfully; the expected result would be a PdfReadError during the read process instead of crashing on a getNumPages().

If we can find conforming PDFs (i.e. opens in Adobe, Foxit, etc.) that exhibit the invalid int... error, they can be very valuable.

@JonathanAnderson
Copy link

I have a file from hsbc that I can manually open but cannot open with this library. I'm happy to pm it to you @mstamy2 if you're interested.

@almereyda
Copy link

almereyda commented May 15, 2017

I also ran into this with PDFShuffler and tickets from DB. How can I investigate this further?

@adch99
Copy link

adch99 commented Jun 24, 2018

Same error arises when trying to access the numPages attribute in this file. Same error also occurs if we use some other function such as obj.getPage(0).

PyPDF2 version 1.26.0 installed from conda on Anaconda3.

(jeepdf) C:\path\to\folder\jeepdf>python jeepdf\processor.py 2017p1.pdf
Traceback (most recent call last):
  File "jeepdf\processor.py", line 8, in <module>
    print("Number of Pages: ", srcPdf.numPages)
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1158, in <lambda>
    numPages = property(lambda self: self.getNumPages(), None, None)
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1155, in getNumPages
    self._flatten()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1505, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1599, in getObject
    idnum, generation = self.readObjectHeader(self.stream)
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'j'

Looks like the header doesn't have them in an int format. However, the file opens in Foxit and Adobe Reader normally.

@arvindnrbt
Copy link

Assignment Animas_No_Provisions.pdf

This is one such pdf that is failing. Can anyone take a look and suggest a workaround?

@fschai89
Copy link

I also using PyPDF2 version 1.26.0, same error occured.

@patroqueeet
Copy link

added potential workaround (ugly monkey patch) in #164

@rohanashik
Copy link

Same problem,
invalid literal for int() with base 10: b'/N'

Please anyone help solve this

2017p1.pdf
Assignment.Animas_No_Provisions.pdf

@rohanashik
Copy link

Have any one tried this one
This one works for me

 input_streams = []

    input_streams.append(fileonepath)
    input_streams.append(filetwopath)

    pdfWriter = PyPDF2.PdfFileWriter()

    # loop through all PDFs
    for filename in input_streams:
        # rb for read binary
        pdfFileObj = open(filename, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # Opening each page of the PDF
        for pageNum in range(pdfReader.numPages):
            pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)
    # save PDF to file, wb for write binary
    pdfOutput = open(OutputPath, 'wb')
    # Outputting the PDF
    pdfWriter.write(pdfOutput)
    # Closing the PDF writer
    pdfOutput.close()

@sayak-parabole
Copy link

I was getting similar errors. Opening the PDF in Adobe Reader showed me the PDF version of the file. It was 1.5. After opening it in Microsoft Word and saving as PDF again it got saved as 1.7 version. After that this issue stopped coming on this 1.7 version of the PDF

@tylerjthomas9
Copy link

This solution worked for me: https://stackoverflow.com/questions/26242952/pypdf-2-decrypt-not-working. I had to use qpdf to decrypt the file before trying to open it in Python.

qpdf --password='' --decrypt input.pdf output.pdf

@barkh22g
Copy link

I had this issue, and it was fixed by opening the PDF in adobe, then saving it as a new doc. It went from version 1.5 to version 1.6, and then the issue went away.

@ParulParima
Copy link

ParulParima commented Aug 12, 2021

I got the same error and this worked for me

install this package - pikepdf

pikepdf is a Python library allowing creation, manipulation and repair of PDFs. It provides a Pythonic wrapper around the C++ PDF content transformation library, QPD.

Now, after installing

import pikepdf

And run this code

try:
    inputpdf = PdfFileReader(open(pdf_address,'rb'))
except ValueError:
    pdf = pikepdf.open(pdf_address,allow_overwriting_input=True)
    pdf.save(pdf_address)
    inputpdf = PdfFileReader(open(pdf_address,'rb'))

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma
Copy link
Member

PyPDF2 had lots of updates since April 2022. I'm closing this issue now as I suspect that it's solved. If you still encounter it with a recent PyPDF2 version, please let me know.

@austinwarnock
Copy link

austinwarnock commented Sep 6, 2022

I was able to recreate this error in PyPDF2==2.10.4 with the following code/pdf.
Generator.pdf

from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
from PyPDF2.generic import AnnotationBuilder
import io

PATH_TO_PDF = "./Generator.pdf"

merger = PdfFileMerger(strict=False)

with open(PATH_TO_PDF, "rb") as pdf: old = io.BytesIO(pdf.read())

reader = PdfFileReader(old)
writer = PdfFileWriter()

for page in reader.pages:
    writer.add_page(page)
    
annotation = AnnotationBuilder.link(rect=[0,0,100,100], target_page_index=0, fit='/Fit', fit_args=(123,))

writer.add_annotation(page_number=1, annotation=annotation)
writer.write(old)

merger.append(old)

In my testing, it appears to only break when annotations are added to some pdfs with a version number <= 1.4.
I can manually fix this by using adobe/bluebeam to update the pdf, but it would be nice to do it programmatically.

EDIT: stack trace

Traceback (most recent call last):
    merger.append(file['stream'], import_outline=False)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_utils.py", line 389, in wrapper
    return func(*args, **kwargs)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_merger.py", line 283, in append
    self.merge(len(self.pages), fileobj, outline_item, pages, import_outline)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_utils.py", line 389, in wrapper
    return func(*args, **kwargs)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_merger.py", line 174, in merge
    pages = (0, len(reader.pages))
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_page.py", line 1708, in __len__
    return self.length_function()
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 400, in _get_num_pages
    self._flatten()
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 1044, in _flatten
    self._flatten(page.get_object(), inherit, **addt)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\generic\_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 1132, in get_object
    idnum, generation = self.read_object_header(self.stream)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 1213, in read_object_header
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'%\xe2\xe3\xcf\xd3'

@pubpub-zz
Copy link
Collaborator

@austinwarnock,
can you paste the full stack of the error you are observing please

@MartinThoma MartinThoma reopened this Sep 7, 2022
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 7, 2022

@austinwarnock,
I think I've found the problem: in your code you are writing to old although it is still being used
I have no issue with this code

from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
from PyPDF2.generic import AnnotationBuilder
import io

PATH_TO_PDF = "./Generator.pdf"

merger = PdfFileMerger(strict=False)

with open(PATH_TO_PDF, "rb") as pdf: old = io.BytesIO(pdf.read())

reader = PdfFileReader(old)

writer = PdfFileWriter()

for page in reader.pages:
    writer.add_page(page)
    
annotation = AnnotationBuilder.link(rect=[0,0,100,100], target_page_index=0, fit='/Fit', fit_args=(123,))

writer.add_annotation(page_number=1, annotation=annotation)

new = io.BytesIO()
writer.write(new)

merger.append(new)

@MartinThoma
Copy link
Member

Thank you for investigating it @pubpub-zz ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests