Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream has ended unexpectedly error on certain PDF files #99

Closed
LunkRat opened this issue May 13, 2014 · 37 comments
Closed

Stream has ended unexpectedly error on certain PDF files #99

LunkRat opened this issue May 13, 2014 · 37 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@LunkRat
Copy link

LunkRat commented May 13, 2014

We process dozens of PDF files per day in our automated script that uses PyPDF2 version 1.21 as part of its process. A few files have been failing with the error pasted below. I can provide the PDF file that is having this error, just let me know how you would like me to send it. Thanks!

PdfReadWarning: Invalid stream (index 0) within object 62 0: Stream has ended unexpectedly [pdf.py:1128]
Traceback (most recent call last):
  File "d:\scripts\mtx-coverpage\mtx-coverpage.py", line 99, in <module>
    addpage.write(outfile)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\merger.py", line 209, in write
    self.output.write(fileobj)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 277, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 350, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 379, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 370, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 1149, in getObject
    retval = self._getObjectFromStream(indirectReference)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 1131, in _getObjectFromStream
    raise utils.PdfReadError("Can't read object stream: %s"%e)
PyPDF2.utils.PdfReadError: Can't read object stream: Stream has ended unexpectedly
@LunkRat
Copy link
Author

LunkRat commented May 13, 2014

The PDF file that triggered this error can be found here: https://drive.google.com/file/d/0B_P1mlgsZIJpRjNnQkxCenUzTkU/edit?usp=sharing

@mstamy2
Copy link
Collaborator

mstamy2 commented May 14, 2014

Thank you for the detailed bug report! I will try to track down the issue soon, though I will be unavailable for the following week.

@mstamy2
Copy link
Collaborator

mstamy2 commented May 26, 2014

Hello,
Try passing strict = False to PdfFileReader(). I still need to track down exactly why the exception is thrown, but this should produce the output without error. (If not, let us know)

@LunkRat
Copy link
Author

LunkRat commented May 27, 2014

Thanks for the response. I am using PdfFileMerger() in this script, and the error occurs on .write - how to I pass strict = False in this case, since I am not calling PdfFileReader() directly?

@mstamy2
Copy link
Collaborator

mstamy2 commented May 27, 2014

PdfFileMerger() constructor takes a strict parameter as well. It should throw a warning instead of an exception when set to False.

@LunkRat
Copy link
Author

LunkRat commented May 27, 2014

Thanks! I set strict = False in PdfFileMerger() (thought I had tried that already but must have been my error) and it solved our issue. I still get a warning, but the output file is written as expected so that's great. I'll let you close the issue when you feel that the underlying cause is resolved. Thanks again.

@andrewstolarz
Copy link

Hello,

I have been recently getting a similar error.

can you please post me an example on how/where to implement the fix?

Thank you!

@andrewstolarz
Copy link

Hello,

Just to give an update.... what I did was manually edit the pdf.py file to set strict = False (I was hoping not to do it this way as I don't want to run into issues later on when I upgrade.

However, after running the script again with strict set to false, it splits the PDF's no problem, however it still returns an error:

PdfReadWarning: Invalid stream (index 77) within object 1444 0: Stream has ended unexpectedly [pdf.py:1162]
PdfReadWarning: Invalid stream (index 62) within object 2696 0: Stream has ended unexpectedly [pdf.py:1162]

Any ideas?

@mstamy2
Copy link
Collaborator

mstamy2 commented Jun 2, 2014

Well, you don't have to change pdf.py in order to set strict to False. You can set the value of strict when you first create your PdfFileMerger() or PdfFileReader() object in its constructor, and it defaults to true if you don't specify a value. To specify False, use
input = PdfFileReader([your file], strict = False)

When in strict mode, PyPDF2 quits when encountering this stream error and throws a PdfReadError. When strict is False, it ignores this error but instead gives a warning like you saw (then continues with rest of program as normal).

Ignoring the error doesn't seem to harm the output in any way (as you noticed), so we need to investigate why the error is thrown at all (maybe PyPDF2 is too strict on slightly 'irregular' PDFs?). Or maybe the error is significant but the output PDFs haven't displayed any symptoms?

Hope that made a little sense.

@mstamy2 mstamy2 added the Bug label Jun 5, 2014
@cryptid11
Copy link

cryptid11 commented Jan 29, 2015

maybe works for PdfFileReader but not with PdfFileMerger.

I try merger = PdfFileMerger(strict = False)

and also into merger.append(PdfFileReader(open(os.path.join(files_dir, f), "rb"), strict = False))
but it gives the same problem as before

  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 405, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 386, in _sweepIndirectReferences
    data[key] = value
  File "/usr/lib/python2.7/site-packages/PyPDF2/generic.py", line 487, in __setitem__
    if not isinstance(key, PdfObject):
RuntimeError: maximum recursion depth exceeded in **instancecheck**

I also set

    def __init__(self, stream, strict=False, warndest = None, overwriteWarnings = True):

at line 891 of pdf.py and doesn't works, any solution?

craig3050 added a commit to craig3050/DrawingRenamer that referenced this issue Jul 11, 2017
relaxes when dealing with slightly dodgy PDF's.
py-pdf/pypdf#99
@appurwar
Copy link

appurwar commented Jul 19, 2018

Try merging your PDFs by using the 'append' and 'merge' functionality of PyPDF2 instead.

I faced the same issue and following approach worked for me -

from PyPDF2 import PdfFileMerger

merger = PdfFileMerger()

input1 = open("file1.pdf", "rb")
input2 = open("file2.pdf", "rb")


# add the first 3 pages of first file to output
merger.append(fileobj = input1, pages = (0,3))

# insert the first page of second file into the output beginning after the second page
merger.merge(position = 2, fileobj = input2, pages = (0,1))

# Write to an output PDF document
output = open("document-output.pdf", "wb")
merger.write(output)

Remove the 'pages' argument in 'append' and 'merge' functions to merge files instead of specific pages.

@khibma
Copy link

khibma commented Aug 9, 2018

I just started to experience this issue when calling PdfFileReader. I haven't changed anything in the code, maybe a windows update? None of the above suggestions to set strict = False seem to help. I have to go in and comment out the file work inside _showwarning and pass on the function to get anywhere.

The only difference in my case different from the above is it only happens after I've run the code once. I'm calling calling this from within ArcGIS (mapping software). I have to close the software and re-open it to get the 1st successful run. This seems to indicate that something is being held onto after the 1st run...but again, it just started happening. I realize this probably doesn't help you move towards a fix: just reporting to up the user count for this.

Edit - "fix":
Despite ignoring strict and setting overwriteWarnigngs to false, I'd still get the error. I found I can get around the error by resetting Python's built in warnings to the original stderr.

import warnings
warnings.resetwarnings()
warnings.sys.stderr = sys.__stderr__

@reportgunner
Copy link

@appurwar the error is returned no matter if append or merge is used. The problem here seems to be the format of the PDF that is being appended, so it's not PyPDF2's fault. A sensible workaround seems to reformat the PDF in some other way before passing it to PyPDF2 once this is detected.

strict=False doesn't fix this either, I came here after the error happened with strict=False on.

@MartinThoma
Copy link
Member

I'm closing this issue now as it seems to be mostly about using strict=False which is the current default. Let me know if you still have this issue (with a full Traceback + example code ... and a PDF if possible)

@puri-gagan
Copy link

puri-gagan commented Jul 6, 2022

Thanks @mstamy2 @mstamy2. strict=False while reading the pdf from PdfFileReader() works great and the rewritten or merged file won't get harmed but if some workaround done on pdf file that might affect the pdf structure will cause the same error though the strict=False is done. Not a problem of this package

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

Hello
I've this error with PdfFileReader() also i'm using strict=False
any help

Traceback (most recent call last):
File "/New Volume/projects/Files/PDF/scrap.py", line 30, in
extracted_data=extracted_data+(pdfReader.getPage(z).extractText().splitlines())
File "/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 1045, in extractText
return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
File "/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
content = ContentStream(content, self.pdf)
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in init
self.__parseContentStream(stream)
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
operands.append(read_object(stream, None))
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
return readStringFromStream(stream)
File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

@MartinThoma
Copy link
Member

Which version of PyPDF2 do you use?

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

Version 2.0.0

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Aug 10, 2022

@Eslafif,
can you please upgrade to latest version to confirm the error is still present. If so, can you provide the PDF file, and precise on which page you are getting the issue

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

Updated and same error exist

@pubpub-zz
Copy link
Collaborator

Updated and same error exist

and can you provide the pdf file and the page. without, no analysis can be done

@MartinThoma MartinThoma reopened this Aug 10, 2022
@Eslafif
Copy link

Eslafif commented Aug 10, 2022

RBL BANK.pdf

This's the page that gives the error

@pubpub-zz
Copy link
Collaborator

@Eslafif
I've tried the following code with your file successfully.
import PyPDF2;p=PyPDF2.PdfReader("c:/RBL.BANK.pdf");p.pages[0].extract_text()
Can you confirm that you are getting the same results

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

Tested and giving the same error

@pubpub-zz
Copy link
Collaborator

Can you share the output please

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

Traceback (most recent call last):
  File "/media/New Volume/projects/bank statement/Banks statements/test.py", line 11, in <module>
    extracted_data=pdfReader.pages[17].extract_text()
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
    content = ContentStream(content, self.pdf)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in __init__
    self.__parseContentStream(stream)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
    operands.append(read_object(stream, None))
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
    return readStringFromStream(stream)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

@pubpub-zz
Copy link
Collaborator

you are not using my code and the file you've provided. Can you tell what is the result with my program please

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

Traceback (most recent call last):
File "/media/New Volume/projects/bank statement/Banks statements/test.py", line 7, in
p.pages[17].extract_text()
File "/home/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
content = ContentStream(content, self.pdf)
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in init
self.__parseContentStream(stream)
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
operands.append(read_object(stream, None))
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
return readStringFromStream(stream)
File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

@Eslafif
Copy link

Eslafif commented Aug 10, 2022

this's with your code

different that the file is big so i only attached the page with the problem

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Aug 11, 2022

@Eslafif,
When you've extracted the page, the error in the pdf has been fixed. Can you confirm this assumption testing the code on your "small" file.

Meanwhile, looking at #454 I may have found a fix. as a patch can you modify generic.py line 495:

                if tok.isdigit():
                    # "The number ddd may consist of one, two, or three
                    # octal digits; high-order overflow shall be ignored.
                    # Three octal digits shall be used, with leading zeros
                    # as needed, if the next character of the string is also
                    # a digit." (PDF reference 7.3.4.2, p 16)
                    for _ in range(2):
                        ntok = stream.read(1)
                        if ntok.isdigit():
                            tok += ntok
                        else:
                            **stream.seek(-1,1)**    &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;           _<--- to be added_ 
                            break
                    tok = b_(chr(int(tok, base=8)))

I would like to confirm the fix before releasing the PR

@pubpub-zz
Copy link
Collaborator

@Eslafif,
can you confirm the fix ?

@pubpub-zz
Copy link
Collaborator

@MartinThoma,
I think we can close this issue

@MartinThoma
Copy link
Member

I'm closing the issue as I believe it's solved.

If anybody still has this issue with the latest PyPDF2 version, please let us know.

@mmariani3
Copy link

mmariani3 commented Mar 12, 2023

## Import
from PyPDF2 import PdfReader

## Declare the PdfFileReader instance
pdf = PdfReader(open(r'Bank Statements\20230218-statements-9172-.pdf', 'rb'), strict = False)

## Create a new text file and open it in write mode
with open(r'Personal', 'w') as f:
  ## Loop through the PDF pages
    for page in pdf.pages:
        text = page.extract_text()
      ## Write to the text file
        f.write(text)

@mmariani3
Copy link

mmariani3 commented Mar 12, 2023

Output:
C:\Users\mmariani\Desktop\py4e\Personal>python Extract_pdf_text.py
Traceback (most recent call last):
File "C:\Users\mmariani\Desktop\py4e\Personal\Extract_pdf_text.py", line 11, in
text = page.extract_text()
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2_page.py", line 1851, in extract_text
return self._extract_text(
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2_page.py", line 1356, in _extract_text
content = ContentStream(content, pdf, "bytes")
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic_data_structures.py", line 877, in init
self.__parse_content_stream(stream_bytes)
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic_data_structures.py", line 929, in __parse_content_stream
ii = self._read_inline_image(stream)
File "C:\Users\mmariani\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic_data_structures.py", line 970, in _read_inline_image
raise PdfReadError("Unexpected end of stream")
PyPDF2.errors.PdfReadError: Unexpected end of stream

@pubpub-zz
Copy link
Collaborator

@mmariani3
you are using an old version. Please uninstall PyPDF2 and then install pypdf.
if you can still reproduce the problem, open a new issue and do not forget to provide the pdf.
if it is too private, you can send it to @MartinThoma [email protected]

@mmariani3
Copy link

ah! Works like a charm now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests