TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) #3020

HEKUCHAN · 2024-12-26T08:15:34Z

Hello, My name is Heitor Hirose.

I’m trying to extract text from a PDF issued by the Japanese government using the PyPDFLoader from the langchain_community.document_loaders.pdf module, and then feed it into an LLM.

However, I keep encountering a TypeError when reading the PDF.
You can find the PyPDFLoader documentation here.

I’ve also tried using pypdf on its own, but the same error occurs.

Below, I’ve included my environment information, relevant code snippets, and the full traceback.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

$ python -c "import langchain_community;print(langchain_community.__version__)"
0.3.13

Code + PDF

Target PDF : https://www.soumu.go.jp/main_content/000973465.pdf

This is a code using langchain-community.

from langchain_community.document_loaders.pdf import PyPDFLoader

PDF_URL = "https://www.soumu.go.jp/main_content/000981511.pdf"

loader = PyPDFLoader(PDF_URL)
doc = loader.load() # TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
print(doc)

This is a minimal code using pypdf only.

from pypdf import PdfReader

reader = PdfReader('./pdf/000981511.pdf')
pages = reader.pages

pages[0].extract_text() # TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'

Traceback

This is the complete traceback I see (using pdf loader by langchain-community):

Traceback (most recent call last):
  File "/tmp/ipykernel_46282/1470047492.py", line 18, in <module>
    doc = loader.load()
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 257, in lazy_load
    yield from self.parser.parse(blob)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
    return list(self.lazy_parse(blob))
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 125, in lazy_parse
    yield from [
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 127, in <listcomp>
    page_content=_extract_text_from_page(page=page)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 117, in _extract_text_from_page
    return page.extract_text(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 60, in build_char_map_from_dict
    half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'

This is the complete traceback I see (using pypdf only):

Traceback (most recent call last):
  File "/tmp/ipykernel_50010/3692206826.py", line 10, in <module>
    page.extract_text()
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 60, in build_char_map_from_dict
    half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'

The text was updated successfully, but these errors were encountered:

HEKUCHAN · 2024-12-26T10:22:16Z

#2967
I found that this issue has already been resolved, so I will close it.
Thank you for your contributions to the OSS project.

HEKUCHAN changed the title ~~TypeError in PyPDF when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float)~~ TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) Dec 26, 2024

HEKUCHAN closed this as completed Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) #3020

TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) #3020

HEKUCHAN commented Dec 26, 2024 •

edited

Loading

HEKUCHAN commented Dec 26, 2024 •

edited

Loading

TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) #3020

TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) #3020

Comments

HEKUCHAN commented Dec 26, 2024 • edited Loading

Environment

Code + PDF

Traceback

HEKUCHAN commented Dec 26, 2024 • edited Loading

HEKUCHAN commented Dec 26, 2024 •

edited

Loading

HEKUCHAN commented Dec 26, 2024 •

edited

Loading