Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'utf-16-be' codec can't decode byte 0xXY in position Z: truncated data #988

Closed
MartinThoma opened this issue Jun 14, 2022 · 2 comments
Closed
Assignees
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Jun 14, 2022

When trying to extract the text from a PDF, I get an exception.

Environment

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.2.0

MCVE

This is a minimal, complete example that shows the issue with the pdf 971703.pdf:

from PyPDF2 import PdfReader
reader = PdfReader("971703.pdf")
reader.pages[1].extract_text()
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jun 14, 2022
@MartinThoma MartinThoma self-assigned this Jun 14, 2022
@MartinThoma
Copy link
Member Author

Other PDFs that show the same issue:

@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Jun 14, 2022
@pubpub-zz
Copy link
Collaborator

The data does not respect the expected encoding. robustness inprovement proposed in ref PR

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jun 14, 2022
the data bytes are not matching encoding expectation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants