You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m trying to extract text from a PDF issued by the Japanese government using the PyPDFLoader from the langchain_community.document_loaders.pdf module, and then feed it into an LLM.
However, I keep encountering a TypeError when reading the PDF.
You can find the PyPDFLoader documentation here.
I’ve also tried using pypdf on its own, but the same error occurs.
Below, I’ve included my environment information, relevant code snippets, and the full traceback.
Environment
Which environment were you using when you encountered the problem?
fromlangchain_community.document_loaders.pdfimportPyPDFLoaderPDF_URL="https://www.soumu.go.jp/main_content/000981511.pdf"loader=PyPDFLoader(PDF_URL)
doc=loader.load() # TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'print(doc)
This is a minimal code using pypdf only.
frompypdfimportPdfReaderreader=PdfReader('./pdf/000981511.pdf')
pages=reader.pagespages[0].extract_text() # TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
Traceback
This is the complete traceback I see (using pdf loader by langchain-community):
Traceback (most recent call last):
File "/tmp/ipykernel_46282/1470047492.py", line 18, in <module>
doc = loader.load()
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 31, in load
return list(self.lazy_load())
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/pdf.py", line 257, in lazy_load
yield from self.parser.parse(blob)
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 125, in lazy_parse
yield from [
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 127, in <listcomp>
page_content=_extract_text_from_page(page=page)
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 117, in _extract_text_from_page
return page.extract_text(
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2393, in extract_text
return self._extract_text(
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 1868, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 60, in build_char_map_from_dict
half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
This is the complete traceback I see (using pypdf only):
Traceback (most recent call last):
File "/tmp/ipykernel_50010/3692206826.py", line 10, in <module>
page.extract_text()
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 2393, in extract_text
return self._extract_text(
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_page.py", line 1868, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "/home/hekuta/works/testArea/.venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 60, in build_char_map_from_dict
half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
The text was updated successfully, but these errors were encountered:
HEKUCHAN
changed the title
TypeError in PyPDF when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float)
TypeError when extracting text from PDF: Unsupported operand type(s) for '/' (IndirectObject and float)
Dec 26, 2024
Hello, My name is Heitor Hirose.
I’m trying to extract text from a PDF issued by the Japanese government using the
PyPDFLoader
from thelangchain_community.document_loaders.pdf
module, and then feed it into an LLM.However, I keep encountering a TypeError when reading the PDF.
You can find the
PyPDFLoader
documentation here.I’ve also tried using pypdf on its own, but the same error occurs.
Below, I’ve included my environment information, relevant code snippets, and the full traceback.
Environment
Which environment were you using when you encountered the problem?
Code + PDF
Target PDF : https://www.soumu.go.jp/main_content/000973465.pdf
This is a code using langchain-community.
This is a minimal code using pypdf only.
Traceback
This is the complete traceback I see (using pdf loader by langchain-community):
This is the complete traceback I see (using pypdf only):
The text was updated successfully, but these errors were encountered: