Stream has ended unexpectedly error on certain PDF files #99

LunkRat · 2014-05-13T15:51:57Z

We process dozens of PDF files per day in our automated script that uses PyPDF2 version 1.21 as part of its process. A few files have been failing with the error pasted below. I can provide the PDF file that is having this error, just let me know how you would like me to send it. Thanks!

PdfReadWarning: Invalid stream (index 0) within object 62 0: Stream has ended unexpectedly [pdf.py:1128]
Traceback (most recent call last):
  File "d:\scripts\mtx-coverpage\mtx-coverpage.py", line 99, in <module>
    addpage.write(outfile)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\merger.py", line 209, in write
    self.output.write(fileobj)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 277, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 350, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 379, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 370, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 1149, in getObject
    retval = self._getObjectFromStream(indirectReference)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 1131, in _getObjectFromStream
    raise utils.PdfReadError("Can't read object stream: %s"%e)
PyPDF2.utils.PdfReadError: Can't read object stream: Stream has ended unexpectedly

LunkRat · 2014-05-13T18:26:57Z

The PDF file that triggered this error can be found here: https://drive.google.com/file/d/0B_P1mlgsZIJpRjNnQkxCenUzTkU/edit?usp=sharing

mstamy2 · 2014-05-14T22:28:15Z

Thank you for the detailed bug report! I will try to track down the issue soon, though I will be unavailable for the following week.

mstamy2 · 2014-05-26T21:02:06Z

Hello,
Try passing strict = False to PdfFileReader(). I still need to track down exactly why the exception is thrown, but this should produce the output without error. (If not, let us know)

LunkRat · 2014-05-27T18:34:48Z

Thanks for the response. I am using PdfFileMerger() in this script, and the error occurs on .write - how to I pass strict = False in this case, since I am not calling PdfFileReader() directly?

mstamy2 · 2014-05-27T21:03:09Z

PdfFileMerger() constructor takes a strict parameter as well. It should throw a warning instead of an exception when set to False.

LunkRat · 2014-05-27T21:39:57Z

Thanks! I set strict = False in PdfFileMerger() (thought I had tried that already but must have been my error) and it solved our issue. I still get a warning, but the output file is written as expected so that's great. I'll let you close the issue when you feel that the underlying cause is resolved. Thanks again.

andrewstolarz · 2014-05-30T17:40:34Z

Hello,

I have been recently getting a similar error.

can you please post me an example on how/where to implement the fix?

Thank you!

andrewstolarz · 2014-06-02T15:56:17Z

Hello,

Just to give an update.... what I did was manually edit the pdf.py file to set strict = False (I was hoping not to do it this way as I don't want to run into issues later on when I upgrade.

However, after running the script again with strict set to false, it splits the PDF's no problem, however it still returns an error:

PdfReadWarning: Invalid stream (index 77) within object 1444 0: Stream has ended unexpectedly [pdf.py:1162]
PdfReadWarning: Invalid stream (index 62) within object 2696 0: Stream has ended unexpectedly [pdf.py:1162]

Any ideas?

mstamy2 · 2014-06-02T20:58:58Z

Well, you don't have to change pdf.py in order to set strict to False. You can set the value of strict when you first create your PdfFileMerger() or PdfFileReader() object in its constructor, and it defaults to true if you don't specify a value. To specify False, use
input = PdfFileReader([your file], strict = False)

When in strict mode, PyPDF2 quits when encountering this stream error and throws a PdfReadError. When strict is False, it ignores this error but instead gives a warning like you saw (then continues with rest of program as normal).

Ignoring the error doesn't seem to harm the output in any way (as you noticed), so we need to investigate why the error is thrown at all (maybe PyPDF2 is too strict on slightly 'irregular' PDFs?). Or maybe the error is significant but the output PDFs haven't displayed any symptoms?

Hope that made a little sense.

cryptid11 · 2015-01-29T15:18:37Z

maybe works for PdfFileReader but not with PdfFileMerger.

I try merger = PdfFileMerger(strict = False)

and also into merger.append(PdfFileReader(open(os.path.join(files_dir, f), "rb"), strict = False))
but it gives the same problem as before

  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 405, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 386, in _sweepIndirectReferences
    data[key] = value
  File "/usr/lib/python2.7/site-packages/PyPDF2/generic.py", line 487, in __setitem__
    if not isinstance(key, PdfObject):
RuntimeError: maximum recursion depth exceeded in **instancecheck**

I also set

    def __init__(self, stream, strict=False, warndest = None, overwriteWarnings = True):

at line 891 of pdf.py and doesn't works, any solution?

relaxes when dealing with slightly dodgy PDF's. py-pdf/pypdf#99

appurwar · 2018-07-19T14:48:54Z

Try merging your PDFs by using the 'append' and 'merge' functionality of PyPDF2 instead.

I faced the same issue and following approach worked for me -

from PyPDF2 import PdfFileMerger

merger = PdfFileMerger()

input1 = open("file1.pdf", "rb")
input2 = open("file2.pdf", "rb")


# add the first 3 pages of first file to output
merger.append(fileobj = input1, pages = (0,3))

# insert the first page of second file into the output beginning after the second page
merger.merge(position = 2, fileobj = input2, pages = (0,1))

# Write to an output PDF document
output = open("document-output.pdf", "wb")
merger.write(output)

Remove the 'pages' argument in 'append' and 'merge' functions to merge files instead of specific pages.

khibma · 2018-08-09T13:45:34Z

I just started to experience this issue when calling PdfFileReader. I haven't changed anything in the code, maybe a windows update? None of the above suggestions to set strict = False seem to help. I have to go in and comment out the file work inside _showwarning and pass on the function to get anywhere.

The only difference in my case different from the above is it only happens after I've run the code once. I'm calling calling this from within ArcGIS (mapping software). I have to close the software and re-open it to get the 1st successful run. This seems to indicate that something is being held onto after the 1st run...but again, it just started happening. I realize this probably doesn't help you move towards a fix: just reporting to up the user count for this.

Edit - "fix":
Despite ignoring strict and setting overwriteWarnigngs to false, I'd still get the error. I found I can get around the error by resetting Python's built in warnings to the original stderr.

import warnings
warnings.resetwarnings()
warnings.sys.stderr = sys.__stderr__

reportgunner · 2019-11-11T14:54:35Z

@appurwar the error is returned no matter if append or merge is used. The problem here seems to be the format of the PDF that is being appended, so it's not PyPDF2's fault. A sensible workaround seems to reformat the PDF in some other way before passing it to PyPDF2 once this is detected.

strict=False doesn't fix this either, I came here after the error happened with strict=False on.

MartinThoma · 2022-06-26T09:23:18Z

I'm closing this issue now as it seems to be mostly about using strict=False which is the current default. Let me know if you still have this issue (with a full Traceback + example code ... and a PDF if possible)

puri-gagan · 2022-07-06T07:43:14Z

Thanks @mstamy2 @mstamy2. strict=False while reading the pdf from PdfFileReader() works great and the rewritten or merged file won't get harmed but if some workaround done on pdf file that might affect the pdf structure will cause the same error though the strict=False is done. Not a problem of this package

Eslafif · 2022-08-10T12:11:28Z

Hello
I've this error with PdfFileReader() also i'm using strict=False
any help

Traceback (most recent call last):
File "/New Volume/projects/Files/PDF/scrap.py", line 30, in
extracted_data=extracted_data+(pdfReader.getPage(z).extractText().splitlines())
File "/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 1045, in extractText
return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
File "/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
content = ContentStream(content, self.pdf)
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in init
self.__parseContentStream(stream)
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
operands.append(read_object(stream, None))
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
return readStringFromStream(stream)
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

MartinThoma · 2022-08-10T12:21:45Z

Which version of PyPDF2 do you use?

Eslafif · 2022-08-10T12:45:59Z

Version 2.0.0

pubpub-zz · 2022-08-10T13:06:36Z

@Eslafif,
can you please upgrade to latest version to confirm the error is still present. If so, can you provide the PDF file, and precise on which page you are getting the issue

Eslafif · 2022-08-10T14:56:08Z

Updated and same error exist

pubpub-zz · 2022-08-10T14:57:46Z

Updated and same error exist

and can you provide the pdf file and the page. without, no analysis can be done

Eslafif · 2022-08-10T16:03:56Z

RBL BANK.pdf

This's the page that gives the error

pubpub-zz · 2022-08-10T18:39:46Z

@Eslafif
I've tried the following code with your file successfully.
import PyPDF2;p=PyPDF2.PdfReader("c:/RBL.BANK.pdf");p.pages[0].extract_text()
Can you confirm that you are getting the same results

Eslafif · 2022-08-10T19:30:25Z

Tested and giving the same error

pubpub-zz · 2022-08-10T19:34:25Z

Can you share the output please

Eslafif · 2022-08-10T19:43:29Z

Traceback (most recent call last):
  File "/media/New Volume/projects/bank statement/Banks statements/test.py", line 11, in <module>
    extracted_data=pdfReader.pages[17].extract_text()
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
    content = ContentStream(content, self.pdf)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in __init__
    self.__parseContentStream(stream)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
    operands.append(read_object(stream, None))
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
    return readStringFromStream(stream)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

pubpub-zz · 2022-08-10T19:46:18Z

you are not using my code and the file you've provided. Can you tell what is the result with my program please

Eslafif · 2022-08-10T19:55:44Z

Traceback (most recent call last):
File "/media/New Volume/projects/bank statement/Banks statements/test.py", line 7, in
p.pages[17].extract_text()
File "/home/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
content = ContentStream(content, self.pdf)
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in init
self.__parseContentStream(stream)
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
operands.append(read_object(stream, None))
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
return readStringFromStream(stream)
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

Eslafif · 2022-08-10T19:56:18Z

this's with your code

different that the file is big so i only attached the page with the problem

pubpub-zz · 2022-08-11T10:18:43Z

@Eslafif,
When you've extracted the page, the error in the pdf has been fixed. Can you confirm this assumption testing the code on your "small" file.

Meanwhile, looking at #454 I may have found a fix. as a patch can you modify generic.py line 495:

                if tok.isdigit():
                    # "The number ddd may consist of one, two, or three
                    # octal digits; high-order overflow shall be ignored.
                    # Three octal digits shall be used, with leading zeros
                    # as needed, if the next character of the string is also
                    # a digit." (PDF reference 7.3.4.2, p 16)
                    for _ in range(2):
                        ntok = stream.read(1)
                        if ntok.isdigit():
                            tok += ntok
                        else:
                            **stream.seek(-1,1)**    &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;           _<--- to be added_ 
                            break
                    tok = b_(chr(int(tok, base=8)))

I would like to confirm the fix before releasing the PR

pubpub-zz · 2022-08-21T19:26:57Z

@Eslafif,
can you confirm the fix ?

pubpub-zz · 2022-09-03T12:15:43Z

@MartinThoma,
I think we can close this issue

MartinThoma · 2022-09-06T19:37:20Z

I'm closing the issue as I believe it's solved.

If anybody still has this issue with the latest PyPDF2 version, please let us know.

mmariani3 · 2023-03-12T19:59:16Z

## Import
from PyPDF2 import PdfReader

## Declare the PdfFileReader instance
pdf = PdfReader(open(r'Bank Statements\20230218-statements-9172-.pdf', 'rb'), strict = False)

## Create a new text file and open it in write mode
with open(r'Personal', 'w') as f:
  ## Loop through the PDF pages
    for page in pdf.pages:
        text = page.extract_text()
      ## Write to the text file
        f.write(text)

mmariani3 · 2023-03-12T19:59:59Z

Output:
C:\Users\mmariani\Desktop\py4e\Personal>python Extract_pdf_text.py
Traceback (most recent call last):
File "C:\Users\mmariani\Desktop\py4e\Personal\Extract_pdf_text.py", line 11, in
text = page.extract_text()
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2_page.py", line 1851, in extract_text
return self._extract_text(
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2_page.py", line 1356, in _extract_text
content = ContentStream(content, pdf, "bytes")
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic_data_structures.py", line 877, in init
self.__parse_content_stream(stream_bytes)
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic_data_structures.py", line 929, in __parse_content_stream
ii = self._read_inline_image(stream)
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic_data_structures.py", line 970, in _read_inline_image
raise PdfReadError("Unexpected end of stream")
PyPDF2.errors.PdfReadError: Unexpected end of stream

pubpub-zz · 2023-03-12T20:22:53Z

@mmariani3
you are using an old version. Please uninstall PyPDF2 and then install pypdf.
if you can still reproduce the problem, open a new issue and do not forget to provide the pdf.
if it is too private, you can send it to @MartinThoma [email protected]

mmariani3 · 2023-03-12T22:41:21Z

ah! Works like a charm now

mstamy2 added the Bug label Jun 5, 2014

mauro1855 mentioned this issue Jan 25, 2017

PyPDF fails virantha/pypdfocr#59

Open

craig3050 added a commit to craig3050/DrawingRenamer that referenced this issue Jul 11, 2017

Updated to remove strict mode,

5c48034

relaxes when dealing with slightly dodgy PDF's. py-pdf/pypdf#99

jeromerobert mentioned this issue Feb 5, 2019

Consider using PyPDF2 strict=False mode pdfarranger/pdfarranger#54

Closed

MartinThoma closed this as completed Jun 26, 2022

MartinThoma reopened this Aug 10, 2022

MartinThoma closed this as completed Sep 6, 2022

Stream has ended unexpectedly error on certain PDF files #99

Stream has ended unexpectedly error on certain PDF files #99

Comments

LunkRat commented May 13, 2014

LunkRat commented May 13, 2014

mstamy2 commented May 14, 2014

mstamy2 commented May 26, 2014

LunkRat commented May 27, 2014

mstamy2 commented May 27, 2014

LunkRat commented May 27, 2014

andrewstolarz commented May 30, 2014

andrewstolarz commented Jun 2, 2014

mstamy2 commented Jun 2, 2014

cryptid11 commented Jan 29, 2015 • edited by MartinThoma Loading

appurwar commented Jul 19, 2018 • edited by MartinThoma Loading

khibma commented Aug 9, 2018 • edited Loading

reportgunner commented Nov 11, 2019

MartinThoma commented Jun 26, 2022

puri-gagan commented Jul 6, 2022 • edited Loading

Eslafif commented Aug 10, 2022 • edited Loading

MartinThoma commented Aug 10, 2022

Eslafif commented Aug 10, 2022

pubpub-zz commented Aug 10, 2022 • edited Loading

Eslafif commented Aug 10, 2022

pubpub-zz commented Aug 10, 2022

Eslafif commented Aug 10, 2022

pubpub-zz commented Aug 10, 2022

Eslafif commented Aug 10, 2022

pubpub-zz commented Aug 10, 2022

Eslafif commented Aug 10, 2022 • edited by MartinThoma Loading

pubpub-zz commented Aug 10, 2022

Eslafif commented Aug 10, 2022

Eslafif commented Aug 10, 2022

pubpub-zz commented Aug 11, 2022 • edited by MartinThoma Loading

pubpub-zz commented Aug 21, 2022

pubpub-zz commented Sep 3, 2022

MartinThoma commented Sep 6, 2022

mmariani3 commented Mar 12, 2023 • edited Loading

mmariani3 commented Mar 12, 2023 • edited Loading

pubpub-zz commented Mar 12, 2023

mmariani3 commented Mar 12, 2023

cryptid11 commented Jan 29, 2015 •

edited by MartinThoma

Loading

appurwar commented Jul 19, 2018 •

edited by MartinThoma

Loading

khibma commented Aug 9, 2018 •

edited

Loading

puri-gagan commented Jul 6, 2022 •

edited

Loading

Eslafif commented Aug 10, 2022 •

edited

Loading

pubpub-zz commented Aug 10, 2022 •

edited

Loading

Eslafif commented Aug 10, 2022 •

edited by MartinThoma

Loading

pubpub-zz commented Aug 11, 2022 •

edited by MartinThoma

Loading

mmariani3 commented Mar 12, 2023 •

edited

Loading

mmariani3 commented Mar 12, 2023 •

edited

Loading