Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled Exception in the library crashes code binascii.Error: Odd-length string #1370

Closed
change-is-constant opened this issue Sep 27, 2022 · 7 comments · Fixed by #1372
Closed

Comments

@change-is-constant
Copy link

change-is-constant commented Sep 27, 2022

What happened? What were you trying to achieve?
I am trying to read text from a pdf file. But I am getting an error which I don't think is because of my code. The code crashes for the particular pdf mentioned below.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.22000-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.9

Code + PDF

This is a minimal, complete example that shows the issue:

    if afile.endswith(".pdf"):
        pdfReader = PyPDF2.PdfFileReader(open(afile, "rb"), strict=False)
        texts = pdfReader.getPage(0).extract_text() # the ERROR happens on this line !!!!!
        for kw in keywords:
            if kw not in texts: return False
            # pbar.write(f'{kw} found')
        return True

PDF file: https://d4duas3s44z1s.cloudfront.net/Dat2020Live/document/test/question/Question_Report_299.pdf
don't add them to your tests!

Traceback

This is the complete Traceback I see:

File "path\to\there\Python\Python310\lib\site-packages\PyPDF2\_cmap.py", line 291, in parse_bfrange
    unhexlify(fmt % a).decode(
binascii.Error: Odd-length string
@change-is-constant
Copy link
Author

Also, I am getting these unwanted console.logs for some files even after mentioning strict=False:

Multiple definitions in dictionary at byte 0x83b for key /Filter
Multiple definitions in dictionary at byte 0x358a for key /Filter
Multiple definitions in dictionary at byte 0x3b57 for key /Filter
Multiple definitions in dictionary at byte 0x42fe for key /Filter
Multiple definitions in dictionary at byte 0x4709 for key /Filter
Multiple definitions in dictionary at byte 0x4b08 for key /Filter

@pubpub-zz
Copy link
Collaborator

Thanks for this example.
Your file show some very odd but acceptable cmap definitions. I've produced a fix that improve the robustness.
I've noted also that the non roman characters are not properly read, but the characters can not been copied neither using acrobat reader : the 'Translation table' is not correct within the file.

About the logs, I've not been able to reproduce them after the fix, not tried before : can you confirm ?

@pubpub-zz
Copy link
Collaborator

adding test sample with similar results:
cmap1370.pdf

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 28, 2022
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 28, 2022
MartinThoma pushed a commit that referenced this issue Sep 29, 2022
@MatteoRiva95
Copy link

Hello everyone, I know this is an old topic, but I am facing the same issue (binascii.Error: Odd-length string + Multiple definitions in dictionary at byte 0x3f42b for key /MediaBox) and I do not understand how it was fixed. I am using PyPDFDirectoryLoader from LangChain and the exact error is:

File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 247, in load
raise e
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 239, in load
sub_docs = loader.load()
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 162, in load
return list(self.lazy_load())
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py", line 172, in lazy_load
yield from self.parser.parse(blob)
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/base.py", line 102, in parse
return list(self.lazy_parse(blob))
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/parsers/pdf.py", line 95, in lazy_parse
yield from [
File "/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/parsers/pdf.py", line 97, in
page_content=page.extract_text()
File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 2076, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/dist-packages/pypdf/_page.py", line 1588, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 58, in build_char_map_from_dict
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 237, in parse_to_unicode
process_rg, process_char, multiline_rg = process_cm_line(
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 313, in process_cm_line
multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
File "/usr/local/lib/python3.10/dist-packages/pypdf/_cmap.py", line 372, in parse_bfrange
] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
binascii.Error: Odd-length string

Can you explain, please? Thank you so much in advance!

@stefan6419846
Copy link
Collaborator

See #2216 for the current state.

@MatteoRiva95
Copy link

@stefan6419846 thank you for your reply. I have already commented the blog you posted, but the author could not give me a real solution to the problem. I wondered if someone here could give me a detailed one.

@stefan6419846
Copy link
Collaborator

Just commenting multiple threads of the same issue will most likely not resolve your issue. In the case of #1370, this already could be solved by some further change, see details in #1372. You are always invited to further debug/analyze #2216 and propose a corresponding PR to improve pypdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants