Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) #3020

Closed
HEKUCHAN opened this issue Dec 26, 2024 · 1 comment

Comments

@HEKUCHAN
Copy link

HEKUCHAN commented Dec 26, 2024

Hello, My name is Heitor Hirose.

I’m trying to extract text from a PDF issued by the Japanese government using the PyPDFLoader from the langchain_community.document_loaders.pdf module, and then feed it into an LLM.

However, I keep encountering a TypeError when reading the PDF.
You can find the PyPDFLoader documentation here.

I’ve also tried using pypdf on its own, but the same error occurs.

Below, I’ve included my environment information, relevant code snippets, and the full traceback.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

$ python -c "import langchain_community;print(langchain_community.__version__)"
0.3.13

Code + PDF

Target PDF : https://www.soumu.go.jp/main_content/000973465.pdf

This is a code using langchain-community.

from langchain_community.document_loaders.pdf import PyPDFLoader

PDF_URL = "https://www.soumu.go.jp/main_content/000981511.pdf"

loader = PyPDFLoader(PDF_URL)
doc = loader.load() # TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
print(doc)

This is a minimal code using pypdf only.

from pypdf import PdfReader

reader = PdfReader('./pdf/000981511.pdf')
pages = reader.pages

pages[0].extract_text() # TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'

Traceback

This is the complete traceback I see (using pdf loader by langchain-community):

Traceback (most recent call last):
  File "/tmp/ipykernel_46282/1470047492.py", line 18, in <module>
    doc = loader.load()
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 257, in lazy_load
    yield from self.parser.parse(blob)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
    return list(self.lazy_parse(blob))
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 125, in lazy_parse
    yield from [
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 127, in <listcomp>
    page_content=_extract_text_from_page(page=page)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 117, in _extract_text_from_page
    return page.extract_text(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 60, in build_char_map_from_dict
    half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'

This is the complete traceback I see (using pypdf only):

Traceback (most recent call last):
  File "/tmp/ipykernel_50010/3692206826.py", line 10, in <module>
    page.extract_text()
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 60, in build_char_map_from_dict
    half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
@HEKUCHAN HEKUCHAN changed the title TypeError in PyPDF when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float) Dec 26, 2024
@HEKUCHAN
Copy link
Author

HEKUCHAN commented Dec 26, 2024

#2967
I found that this issue has already been resolved, so I will close it.
Thank you for your contributions to the OSS project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant