'utf-16-be' codec can't decode byte 0xXY in position Z: truncated data #988

MartinThoma · 2022-06-14T16:03:41Z

When trying to extract the text from a PDF, I get an exception.

Environment

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.2.0

MCVE

This is a minimal, complete example that shows the issue with the pdf 971703.pdf:

from PyPDF2 import PdfReader
reader = PdfReader("971703.pdf")
reader.pages[1].extract_text()

MartinThoma · 2022-06-14T16:04:20Z

Other PDFs that show the same issue:

pdf/0126270bb6d7c7fa13697df5e8dc0f35.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/971/971703.pdf)
pdf/0d6cf76b240b1b31e2e07e996be98d00.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/974/974126.pdf)
pdf/233f654b83fb5c763154fc33e55ed93e.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/923/923755.pdf)
pdf/2ae57192482a6dbdc5872f8afa16ecae.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/934/934549.pdf)
pdf/3333a2052bcd17a5dca8a5d995a33b6a.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/988/988668.pdf)
pdf/3d8c30d01669a921996be5be5ce097d2.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/924/924157.pdf)
pdf/424bec82297536c16169fbf988a9d299.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/968/968238.pdf)
pdf/46cc1e964bed800f35c5b76e81d69ac3.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/994/994042.pdf)
pdf/5563b37d72219b4f528e7ec49d85cf8b.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/933/933401.pdf)
pdf/556a41ca786d29c59bcce1c2f32af526.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/909/909689.pdf)
pdf/688a80d4470201f5415782c52bffda6d.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/941/941991.pdf)
pdf/78e135de6ff4c96ab0ddb91842445f91.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/932/932107.pdf)
pdf/7e8562a2627a341bd0cd7c22701fcef5.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/991/991764.pdf)
pdf/8384ba773351d45b9a36687f342c16df.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/954/954324.pdf)
pdf/8804b79ce30bd66ff4aad0be2934db58.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/959/959173.pdf)
pdf/8d076f78ab0a6da054eaaa9c4c962cf8.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/934/934687.pdf)
pdf/9b79ce4ca64acb888be2264cda2a3faa.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/924/924162.pdf)
pdf/9e2b254d5950ee5f63f9476d680c9600.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/977/977972.pdf)
pdf/9eb252a78b49cb2a6bf1de161b264ecd.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/956/956263.pdf)
pdf/a3d2f36a76403cb87a5e015da7dfed45.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/919/919871.pdf)
pdf/aa04f319bf04244687cfccd63e4f681d.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/971/971570.pdf)
pdf/afb988a1c5ea0242783e747fec1c90b2.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/928/928472.pdf)
pdf/b3ca7c2a2b2f7f9b1da1df3cc9d947c8.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/989/989792.pdf)
pdf/c5201b9f84b73a914694c949873638bf.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/927/927333.pdf)
pdf/cb55704177563b5f9d96c720c2fe7182.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/957/957479.pdf)
pdf/d1c79d78394a4d9f3f66591c6d11d9ba.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/999/999404.pdf)
pdf/d5ab1d6e22704bf84cb38ce2183c1e00.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/947/947220.pdf)
pdf/d617cfc7df8382c8d12b495cb9cd902f.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/972/972424.pdf)
pdf/d77bf4f69db00b85ff509acdd2120bd8.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/961/961732.pdf)
pdf/e962b3ef86506611a7047f0263e12da7.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/942/942904.pdf)
pdf/eeba7bf4756762ec34c11d634903c7a2.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/999/999348.pdf)
pdf/f8be1c2a842ac9172f63f40871ae690b.pdf (https://corpora.tika.apache.org/base/docs/govdocs1/935/935956.pdf)

pubpub-zz · 2022-06-14T20:25:52Z

The data does not respect the expected encoding. robustness inprovement proposed in ref PR

the data bytes are not matching encoding expectation

MartinThoma self-assigned this Jun 14, 2022

MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Jun 14, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jun 14, 2022

ROB : utf-16-be' codec can't decode (...) py-pdf#988

641804f

the data bytes are not matching encoding expectation

MartinThoma mentioned this issue Jun 15, 2022

ROB : utf-16-be' codec can't decode (...) #988 #995

Merged

MartinThoma closed this as completed in 034d7a9 Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'utf-16-be' codec can't decode byte 0xXY in position Z: truncated data #988

'utf-16-be' codec can't decode byte 0xXY in position Z: truncated data #988

MartinThoma commented Jun 14, 2022 •

edited

Loading

MartinThoma commented Jun 14, 2022

pubpub-zz commented Jun 14, 2022

'utf-16-be' codec can't decode byte 0xXY in position Z: truncated data #988

'utf-16-be' codec can't decode byte 0xXY in position Z: truncated data #988

Comments

MartinThoma commented Jun 14, 2022 • edited Loading

Environment

MCVE

MartinThoma commented Jun 14, 2022

pubpub-zz commented Jun 14, 2022

MartinThoma commented Jun 14, 2022 •

edited

Loading