Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:1666] #576

Closed
kalkovid19 opened this issue Aug 24, 2020 · 12 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness

Comments

@kalkovid19
Copy link

kalkovid19 commented Aug 24, 2020

Hi all,
I coverting pdf file to text for processing. code was workig fine an drecently it started giving errors like below and not text extraction
PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:1666]

MCVE

from PyPDF2 import PdfReader

reader = PdfReader("TN_24.08.2020.pdf")
text = reader.pages[0].extract_text()
assert "Directorate" in text, text

my pdf file and process code are attached
pdf2txt.py.txt
TN_24.08.2020.pdf

Thanks in advance

@luke4u
Copy link

luke4u commented Aug 24, 2020

got the same issue. Can any please advise how to resolve it?

@Grazx
Copy link

Grazx commented Nov 25, 2020

Well I had the same problem, since I was trying to stamp a template PDF (made by me), on an existing one.
The solution:
I used Foxit Phatom to convert my Template file from PDF_1.4 to PDF_1.7 and the error "PdfReadWarning:" stopped showing.

Hope it helps.

EDIT:
I forgot to mention I also use the "PDF Optimizer" option in Phantom to "flatten" text and objects (more on that in https://www.foxitsoftware.com/blog/pdf-toolkit-pdf-optimizer/)

@lmw0320
Copy link

lmw0320 commented Feb 16, 2022

Well I had the same problem, since I was trying to stamp a template PDF (made by me), on an existing one. The solution: I used Foxit Phatom to convert my Template file from PDF_1.4 to PDF_1.7 and the error "PdfReadWarning:" stopped showing.

Hope it helps.

EDIT: I forgot to mention I also use the "PDF Optimizer" option in Phantom to "flatten" text and objects (more on that in https://www.foxitsoftware.com/blog/pdf-toolkit-pdf-optimizer/)

Hi, for I have plenty of pdffiles, I want to find out all the content in pdffiles, but your solution can not be used as normal operation. Do you have any better code solution instead of treating the file by hand. Thanks

@MartinThoma
Copy link
Member

Could somebody add a minimal Python script that shows the issue with the given files?

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@JGMSPY
Copy link

JGMSPY commented Apr 15, 2022

Here is the code that gave me the subjected problem. I more or less randomly added a pdf file with 4 pages. I also got the error when using a single-paged PDF-file, and where the resulting file was OK.

from PyPDF2 import PdfFileReader, PdfFileWriter,PdfFileMerger

template = PdfFileReader(open('PythonHelp.pdf','rb'))
watermark = PdfFileReader(open("FactuurModelIkke.pdf", 'rb'))
output = PdfFileWriter()

for i in range(template.getNumPages()):
   page = template.getPage(i)
   page.mergePage(watermark.getPage(0))
   output.addPage(page)
file = open('waterMarked_PDF.pdf', 'wb')
output.write(file)

FactuurModelIkke.pdf
PythonHelp.pdf

Hope you can solve this.

@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 16, 2022
@Rapid1898-code
Copy link

Hello - i have the same problem.
When anybody find a solution for that - this would be great.

@MartinThoma
Copy link
Member

@Rapid1898-code "me too" comments don't provide any value. They distract and prevent devs from working on the issue.

If you want to help, please provide a full minimal example:

  • code
  • pdf
  • traceback
  • environment (python version, py-pdf version)

@prz38573485
Copy link

I met the same issue.
PdfReadWarning: Superfluous whitespace found in object header b'225' b'0' [_reader.py:891]

@pubpub-zz
Copy link
Collaborator

All,
What you are reporting are some warnings that are not stopping the program. The PdfReadWarnings will not be reported if you set strict=False when calling the PdfFileReader constructor. In version 1.27 the default value is set to True, in the current 2.0.0-dev branch (for next release) it will be changed to False and by default the warnings will disappear without any change in your programs 😉

@JGMSPY
Copy link

JGMSPY commented May 18, 2022

All,
With great help from samples I managed to get the routine working without problems. This was my first Python experience and I learned a lot. Thanks you.

@MartinThoma
Copy link
Member

I just checked with the current main branch and the minimal example from the first post - the issue is still there.

@MartinThoma
Copy link
Member

I've just executed the MCVE example from the first post with the latest version of PyPDF2. Seems to work 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

No branches or pull requests

9 participants