ValueError: invalid literal for int() with base 10: #183

ghost · 2015-03-02T17:46:58Z

Using latest version: PyPDF2-1.24.tar.gz
With code:

import PyPDF2 as pyPdf
inputpdf = pyPdf.PdfFileReader(open('file.pdf', 'rb'))

ValueError: invalid literal for int() with base 10: '2pGF'

lines of pdf:
line 143 - >>
line 144 - endobj
line 145 - 16 0 obj <</Length 8905 /Filter[/A85 /Fl]>> stream
line 146 - Gb![snip]2bGF[snip]J~>

If I import the full string (Gb![snip]2bGF[snip]J~) into python and use a85decode, I get the proper byte array.

fgeek · 2015-03-07T20:55:48Z

Sample file in http://bugs.fi/media/afl/pypdf2/pypdf2-afl-invalid-literal-int-with-base-10.pdf (SHA1 9d25406c4a3c9f5ea61bc96f9251d2f7f186ebf7) with following Python code demonstrates this issue and can be used as a reproducer. Fuzzed with American fuzzy lop and https://bitbucket.org/jwilk/python-afl.

import PyPDF2 as pyPdf
input = pyPdf.PdfFileReader(open('pypdf2-afl-invalid-literal-int-with-base-10.pdf', 'rb'))
print "document1.pdf has %d pages." % input.getNumPages()

Traceback (most recent call last):
  File "crasher.py", line 3, in <module>
    print "document1.pdf has %d pages." % input.getNumPages()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 983, in getNumPages
    self._flatten()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1280, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/generic.py", line 501, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/generic.py", line 177, in getObject
    return self.pdf.getObject(self).getObject()
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1372, in getObject
    idnum, generation = self.readObjectHeader(self.stream)
  File "/home/fgeek/utils/builds/python/2.7.9/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1440, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: '\xb1'

StErMi · 2016-05-19T08:09:23Z

Did you find a way to workaround this issue?

mstamy2 · 2016-05-20T20:48:44Z

Unfortunately, not all the ValueError: invalid lit... issues are as related as they appear to be.

They generally just indicate a parsing error, and they occur frequently when the file deviates from the PDF standard in some way.

The good news is, parsing errors aren't terribly difficult to track down, provided I can access the file that triggers them.

That said, if anyone would like to submit a PDF I would be happy to take a look (the link in the second comment is broken).

fgeek · 2016-05-21T08:47:51Z

It is working OK for me (owner of that site).

wget http://bugs.fi/media/afl/pypdf2/pypdf2-afl-invalid-literal-int-with-base-10.pdf
hsalo@tunkki:$ file pypdf2-afl-invalid-literal-int-with-base-10.pdf
pypdf2-afl-invalid-literal-int-with-base-10.pdf: PDF document, version 1.0
hsalo@tunkki:$ md5sum pypdf2-afl-invalid-literal-int-with-base-10.pdf
073c37cc362031f5550a89977137621f pypdf2-afl-invalid-literal-int-with-base-10.pdf

mstamy2 · 2016-05-26T20:06:16Z

It seems that PDF is invalid (can't be opened by any conforming reader), so PyPDF2 would be expected to fail when reading it.

That said, it is misleading because it seems to be read successfully; the expected result would be a PdfReadError during the read process instead of crashing on a getNumPages().

If we can find conforming PDFs (i.e. opens in Adobe, Foxit, etc.) that exhibit the invalid int... error, they can be very valuable.

JonathanAnderson · 2016-10-12T18:23:23Z

I have a file from hsbc that I can manually open but cannot open with this library. I'm happy to pm it to you @mstamy2 if you're interested.

almereyda · 2017-05-15T10:38:59Z

I also ran into this with PDFShuffler and tickets from DB. How can I investigate this further?

adch99 · 2018-06-24T11:29:09Z

Same error arises when trying to access the numPages attribute in this file. Same error also occurs if we use some other function such as obj.getPage(0).

PyPDF2 version 1.26.0 installed from conda on Anaconda3.

(jeepdf) C:\path\to\folder\jeepdf>python jeepdf\processor.py 2017p1.pdf
Traceback (most recent call last):
  File "jeepdf\processor.py", line 8, in <module>
    print("Number of Pages: ", srcPdf.numPages)
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1158, in <lambda>
    numPages = property(lambda self: self.getNumPages(), None, None)
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1155, in getNumPages
    self._flatten()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1505, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1599, in getObject
    idnum, generation = self.readObjectHeader(self.stream)
  File "C:\path\to\Anaconda3\envs\jeepdf\lib\site-packages\PyPDF2\pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'j'

Looks like the header doesn't have them in an int format. However, the file opens in Foxit and Adobe Reader normally.

arvindnrbt · 2018-10-06T04:37:55Z

Assignment Animas_No_Provisions.pdf

This is one such pdf that is failing. Can anyone take a look and suggest a workaround?

fschai89 · 2018-10-11T04:41:54Z

I also using PyPDF2 version 1.26.0, same error occured.

patroqueeet · 2019-03-27T08:10:30Z

added potential workaround (ugly monkey patch) in #164

rohanashik · 2020-04-03T18:52:22Z

Same problem,
invalid literal for int() with base 10: b'/N'

Please anyone help solve this

2017p1.pdf
Assignment.Animas_No_Provisions.pdf

rohanashik · 2020-04-03T19:28:15Z

Have any one tried this one
This one works for me

 input_streams = []

    input_streams.append(fileonepath)
    input_streams.append(filetwopath)

    pdfWriter = PyPDF2.PdfFileWriter()

    # loop through all PDFs
    for filename in input_streams:
        # rb for read binary
        pdfFileObj = open(filename, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # Opening each page of the PDF
        for pageNum in range(pdfReader.numPages):
            pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)
    # save PDF to file, wb for write binary
    pdfOutput = open(OutputPath, 'wb')
    # Outputting the PDF
    pdfWriter.write(pdfOutput)
    # Closing the PDF writer
    pdfOutput.close()

sayak-parabole · 2021-05-20T18:37:20Z

I was getting similar errors. Opening the PDF in Adobe Reader showed me the PDF version of the file. It was 1.5. After opening it in Microsoft Word and saving as PDF again it got saved as 1.7 version. After that this issue stopped coming on this 1.7 version of the PDF

tylerjthomas9 · 2021-05-24T19:41:27Z

This solution worked for me: https://stackoverflow.com/questions/26242952/pypdf-2-decrypt-not-working. I had to use qpdf to decrypt the file before trying to open it in Python.

qpdf --password='' --decrypt input.pdf output.pdf

barkh22g · 2021-08-11T18:47:20Z

I had this issue, and it was fixed by opening the PDF in adobe, then saving it as a new doc. It went from version 1.5 to version 1.6, and then the issue went away.

ParulParima · 2021-08-12T04:12:42Z

I got the same error and this worked for me

install this package - pikepdf

pikepdf is a Python library allowing creation, manipulation and repair of PDFs. It provides a Pythonic wrapper around the C++ PDF content transformation library, QPD.

Now, after installing

import pikepdf

And run this code

try:
    inputpdf = PdfFileReader(open(pdf_address,'rb'))
except ValueError:
    pdf = pikepdf.open(pdf_address,allow_overwriting_input=True)
    pdf.save(pdf_address)
    inputpdf = PdfFileReader(open(pdf_address,'rb'))

MartinThoma · 2022-06-26T09:28:14Z

PyPDF2 had lots of updates since April 2022. I'm closing this issue now as I suspect that it's solved. If you still encounter it with a recent PyPDF2 version, please let me know.

austinwarnock · 2022-09-06T13:19:39Z

I was able to recreate this error in PyPDF2==2.10.4 with the following code/pdf.
Generator.pdf

from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
from PyPDF2.generic import AnnotationBuilder
import io

PATH_TO_PDF = "./Generator.pdf"

merger = PdfFileMerger(strict=False)

with open(PATH_TO_PDF, "rb") as pdf: old = io.BytesIO(pdf.read())

reader = PdfFileReader(old)
writer = PdfFileWriter()

for page in reader.pages:
    writer.add_page(page)
    
annotation = AnnotationBuilder.link(rect=[0,0,100,100], target_page_index=0, fit='/Fit', fit_args=(123,))

writer.add_annotation(page_number=1, annotation=annotation)
writer.write(old)

merger.append(old)

In my testing, it appears to only break when annotations are added to some pdfs with a version number <= 1.4.
I can manually fix this by using adobe/bluebeam to update the pdf, but it would be nice to do it programmatically.

EDIT: stack trace

Traceback (most recent call last):
    merger.append(file['stream'], import_outline=False)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_utils.py", line 389, in wrapper
    return func(*args, **kwargs)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_merger.py", line 283, in append
    self.merge(len(self.pages), fileobj, outline_item, pages, import_outline)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_utils.py", line 389, in wrapper
    return func(*args, **kwargs)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_merger.py", line 174, in merge
    pages = (0, len(reader.pages))
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_page.py", line 1708, in __len__
    return self.length_function()
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 400, in _get_num_pages
    self._flatten()
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 1044, in _flatten
    self._flatten(page.get_object(), inherit, **addt)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\generic\_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 1132, in get_object
    idnum, generation = self.read_object_header(self.stream)
  File "REDACTED\Python\Python39\lib\site-packages\PyPDF2\_reader.py", line 1213, in read_object_header
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'%\xe2\xe3\xcf\xd3'

pubpub-zz · 2022-09-06T20:01:45Z

@austinwarnock,
can you paste the full stack of the error you are observing please

pubpub-zz · 2022-09-07T19:12:03Z

@austinwarnock,
I think I've found the problem: in your code you are writing to old although it is still being used
I have no issue with this code

from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
from PyPDF2.generic import AnnotationBuilder
import io

PATH_TO_PDF = "./Generator.pdf"

merger = PdfFileMerger(strict=False)

with open(PATH_TO_PDF, "rb") as pdf: old = io.BytesIO(pdf.read())

reader = PdfFileReader(old)

writer = PdfFileWriter()

for page in reader.pages:
    writer.add_page(page)
    
annotation = AnnotationBuilder.link(rect=[0,0,100,100], target_page_index=0, fit='/Fit', fit_args=(123,))

writer.add_annotation(page_number=1, annotation=annotation)

new = io.BytesIO()
writer.write(new)

merger.append(new)

MartinThoma · 2022-09-09T06:43:41Z

Thank you for investigating it @pubpub-zz ❤️

fgeek mentioned this issue Mar 7, 2015

ValueError: invalid literal for int() with base 10: 'obj' #164

Closed

mstamy2 mentioned this issue May 20, 2016

ValueError: invalid literal for int() with base 10: "5347+01'00')>>" #262

Closed

william-andre mentioned this issue Mar 5, 2021

[FIX] tools: support malformed more malformed pdf odoo/odoo#67283

Closed

This was referenced Mar 19, 2021

[FW][FIX] tools: support malformed more malformed pdf odoo/odoo#68170

Closed

[FW][FIX] tools: support malformed more malformed pdf odoo/odoo#68171

Closed

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022

MartinThoma closed this as completed Jul 9, 2022

MartinThoma reopened this Sep 7, 2022

MartinThoma closed this as completed Sep 9, 2022

pubpub-zz mentioned this issue May 20, 2023

Random whitespaces are inserted when using page.extract_text() #1507

Closed

pubpub-zz mentioned this issue Sep 15, 2023

PageObject._get_fonts() returns embedded as unembedded. #2192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: invalid literal for int() with base 10: #183

ValueError: invalid literal for int() with base 10: #183

ghost commented Mar 2, 2015 •

edited by MartinThoma

Loading

fgeek commented Mar 7, 2015 •

edited by MartinThoma

Loading

StErMi commented May 19, 2016

mstamy2 commented May 20, 2016

fgeek commented May 21, 2016

mstamy2 commented May 26, 2016

JonathanAnderson commented Oct 12, 2016

almereyda commented May 15, 2017 •

edited

Loading

adch99 commented Jun 24, 2018

arvindnrbt commented Oct 6, 2018

fschai89 commented Oct 11, 2018

patroqueeet commented Mar 27, 2019

rohanashik commented Apr 3, 2020

rohanashik commented Apr 3, 2020

sayak-parabole commented May 20, 2021

tylerjthomas9 commented May 24, 2021

barkh22g commented Aug 11, 2021

ParulParima commented Aug 12, 2021 •

edited by MartinThoma

Loading

MartinThoma commented Jun 26, 2022

austinwarnock commented Sep 6, 2022 •

edited by MartinThoma

Loading

pubpub-zz commented Sep 6, 2022

pubpub-zz commented Sep 7, 2022 •

edited by MartinThoma

Loading

MartinThoma commented Sep 9, 2022

ValueError: invalid literal for int() with base 10: #183

ValueError: invalid literal for int() with base 10: #183

Comments

ghost commented Mar 2, 2015 • edited by MartinThoma Loading

fgeek commented Mar 7, 2015 • edited by MartinThoma Loading

StErMi commented May 19, 2016

mstamy2 commented May 20, 2016

fgeek commented May 21, 2016

mstamy2 commented May 26, 2016

JonathanAnderson commented Oct 12, 2016

almereyda commented May 15, 2017 • edited Loading

adch99 commented Jun 24, 2018

arvindnrbt commented Oct 6, 2018

fschai89 commented Oct 11, 2018

patroqueeet commented Mar 27, 2019

rohanashik commented Apr 3, 2020

rohanashik commented Apr 3, 2020

sayak-parabole commented May 20, 2021

tylerjthomas9 commented May 24, 2021

barkh22g commented Aug 11, 2021

ParulParima commented Aug 12, 2021 • edited by MartinThoma Loading

MartinThoma commented Jun 26, 2022

austinwarnock commented Sep 6, 2022 • edited by MartinThoma Loading

EDIT: stack trace

pubpub-zz commented Sep 6, 2022

pubpub-zz commented Sep 7, 2022 • edited by MartinThoma Loading

MartinThoma commented Sep 9, 2022

ghost commented Mar 2, 2015 •

edited by MartinThoma

Loading

fgeek commented Mar 7, 2015 •

edited by MartinThoma

Loading

almereyda commented May 15, 2017 •

edited

Loading

ParulParima commented Aug 12, 2021 •

edited by MartinThoma

Loading

austinwarnock commented Sep 6, 2022 •

edited by MartinThoma

Loading

pubpub-zz commented Sep 7, 2022 •

edited by MartinThoma

Loading