Unhandled Exception in the library crashes code `binascii.Error: Odd-length string` #1370

change-is-constant · 2022-09-27T02:09:30Z

What happened? What were you trying to achieve?
I am trying to read text from a pdf file. But I am getting an error which I don't think is because of my code. The code crashes for the particular pdf mentioned below.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.22000-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.9

Code + PDF

This is a minimal, complete example that shows the issue:

    if afile.endswith(".pdf"):
        pdfReader = PyPDF2.PdfFileReader(open(afile, "rb"), strict=False)
        texts = pdfReader.getPage(0).extract_text() # the ERROR happens on this line !!!!!
        for kw in keywords:
            if kw not in texts: return False
            # pbar.write(f'{kw} found')
        return True

PDF file: https://d4duas3s44z1s.cloudfront.net/Dat2020Live/document/test/question/Question_Report_299.pdf
don't add them to your tests!

Traceback

This is the complete Traceback I see:

File "path\to\there\Python\Python310\lib\site-packages\PyPDF2\_cmap.py", line 291, in parse_bfrange
    unhexlify(fmt % a).decode(
binascii.Error: Odd-length string

The text was updated successfully, but these errors were encountered:

change-is-constant · 2022-09-27T02:24:34Z

Also, I am getting these unwanted console.logs for some files even after mentioning strict=False:

Multiple definitions in dictionary at byte 0x83b for key /Filter
Multiple definitions in dictionary at byte 0x358a for key /Filter
Multiple definitions in dictionary at byte 0x3b57 for key /Filter
Multiple definitions in dictionary at byte 0x42fe for key /Filter
Multiple definitions in dictionary at byte 0x4709 for key /Filter
Multiple definitions in dictionary at byte 0x4b08 for key /Filter

pubpub-zz · 2022-09-27T21:34:36Z

Thanks for this example.
Your file show some very odd but acceptable cmap definitions. I've produced a fix that improve the robustness.
I've noted also that the non roman characters are not properly read, but the characters can not been copied neither using acrobat reader : the 'Translation table' is not correct within the file.

About the logs, I've not been able to reproduce them after the fix, not tried before : can you confirm ?

pubpub-zz · 2022-09-28T16:45:25Z

adding test sample with similar results:
cmap1370.pdf

MatteoRiva95 · 2024-02-05T08:15:23Z

Hello everyone, I know this is an old topic, but I am facing the same issue (binascii.Error: Odd-length string + Multiple definitions in dictionary at byte 0x3f42b for key /MediaBox) and I do not understand how it was fixed. I am using PyPDFDirectoryLoader from LangChain and the exact error is:

File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 247, in load
raise e
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 239, in load
sub_docs = loader.load()
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 162, in load
return list(self.lazy_load())
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 172, in lazy_load
yield from self.parser.parse(blob)
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/base.py", line 102, in parse
return list(self.lazy_parse(blob))
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/parsers/pdf.py", line 95, in lazy_parse
yield from [
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/parsers/pdf.py", line 97, in
page_content=page.extract_text()
File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 2076, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 1588, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 58, in build_char_map_from_dict
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 237, in parse_to_unicode
process_rg, process_char, multiline_rg = process_cm_line(
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 313, in process_cm_line
multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 372, in parse_bfrange
] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
binascii.Error: Odd-length string

Can you explain, please? Thank you so much in advance!

stefan6419846 · 2024-02-05T08:16:58Z

See #2216 for the current state.

MatteoRiva95 · 2024-02-05T08:30:51Z

@stefan6419846 thank you for your reply. I have already commented the blog you posted, but the author could not give me a real solution to the problem. I wondered if someone here could give me a detailed one.

stefan6419846 · 2024-02-05T08:47:58Z

Just commenting multiple threads of the same issue will most likely not resolve your issue. In the case of #1370, this already could be solved by some further change, see details in #1372. You are always invited to further debug/analyze #2216 and propose a corresponding PR to improve pypdf.

pubpub-zz mentioned this issue Sep 27, 2022

ROB: Cope with cmap from #1370 #1372

Merged

MartinThoma closed this as completed in #1372 Sep 28, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 28, 2022

TST : adding test for py-pdf#1370

d600e99

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 28, 2022

TST : adding test for py-pdf#1370

f8797fc

MartinThoma pushed a commit that referenced this issue Sep 29, 2022

TST: Adding test for #1370 (#1375)

9d870a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhandled Exception in the library crashes code `binascii.Error: Odd-length string` #1370

Unhandled Exception in the library crashes code `binascii.Error: Odd-length string` #1370

change-is-constant commented Sep 27, 2022 •

edited

Loading

change-is-constant commented Sep 27, 2022

pubpub-zz commented Sep 27, 2022

pubpub-zz commented Sep 28, 2022

MatteoRiva95 commented Feb 5, 2024

stefan6419846 commented Feb 5, 2024

MatteoRiva95 commented Feb 5, 2024

stefan6419846 commented Feb 5, 2024

Unhandled Exception in the library crashes code binascii.Error: Odd-length string #1370

Unhandled Exception in the library crashes code binascii.Error: Odd-length string #1370

Comments

change-is-constant commented Sep 27, 2022 • edited Loading

Environment

Code + PDF

Traceback

change-is-constant commented Sep 27, 2022

pubpub-zz commented Sep 27, 2022

pubpub-zz commented Sep 28, 2022

MatteoRiva95 commented Feb 5, 2024

stefan6419846 commented Feb 5, 2024

MatteoRiva95 commented Feb 5, 2024

stefan6419846 commented Feb 5, 2024

Unhandled Exception in the library crashes code `binascii.Error: Odd-length string` #1370

Unhandled Exception in the library crashes code `binascii.Error: Odd-length string` #1370

change-is-constant commented Sep 27, 2022 •

edited

Loading