PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

moyueheng · 2024-10-06T18:37:45Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

pdf_miner_parser = PDFMinerParser(extract_images=True)
with open("examp.pdf") as f:
    blob = Blob(data=f.read())
    pdf_miner_parser.parse(blob)

Error Message and Stack Trace (if applicable)

File "agents/pdf2md/pdf2md_agent.py", line 30, in _process_file
documents = pdf_miner_parser.parse(blob)
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 215, in lazy_parse
content = text_io.getvalue() + self._extract_images_from_page(
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 239, in _extract_images_from_page
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
AttributeError: 'list' object has no attribute 'name'

Description

I think I can fix this bug

System Info

python -m langchain_core.sys_info

System Information

OS: Linux
OS Version: #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023
Python Version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]

Package Information

langchain_core: 0.3.9
langchain: 0.3.2
langchain_community: 0.3.1
langsmith: 0.1.131
langchain_openai: 0.2.2
langchain_text_splitters: 0.3.0
langserve: 0.3.0

Optional packages not installed

langgraph

Other Dependencies

aiohttp: 3.10.9
async-timeout: 4.0.3
dataclasses-json: 0.6.7
fastapi: 0.115.0
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.51.0
orjson: 3.10.7
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.5.2
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
sse-starlette: 1.8.2
tenacity: 8.5.0
tiktoken: 0.8.0
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

Thank you for contributing to LangChain! **PR title**: "community: fix PDF Filter Type Error" - **Description:** fix PDF Filter Type Error" - **Issue:** the issue #27153 it fixes, - **Dependencies:** no - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]>

dosubot · 2025-01-05T16:07:09Z

Hi, @moyueheng. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

The issue involves a bug in the PDFMinerParser component of LangChain.
An AttributeError occurs due to incorrect access of a 'name' attribute on a list object.
The error arises when the parser fails to identify the filter type during PDF processing.
You have provided example code and a PDF file to demonstrate the issue.
There have been no further comments or activity on this issue.

Next Steps:

Please confirm if this issue is still relevant with the latest version of LangChain. If so, you can keep the discussion open by commenting here.
If there is no response, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

…from PDF. (#29378) - **Description:** The issue has been fixed where images could not be recognized from ```xObject[obj]["/Filter"]``` (whose value can be either a string or a list of strings) in the ```_extract_images_from_page()``` method. It also resolves the bug where vectorization by Faiss fails due to the failure of image extraction from a PDF containing only images```IndexError: list index out of range```. ![69a60f3f6bd474641b9126d74bb18f7e](https://github.com/user-attachments/assets/dc9e098d-2862-49f7-93b0-00f1056727dc) - **Issue:** Fix the following issues: [#15227 ](#15227) [#22892 ](#22892) [#26652 ](#26652) [#27153 ](#27153) Related issues: [#7067 ](#7067) - **Dependencies:** None - **Twitter handle:** None --------- Co-authored-by: Chester Curme <[email protected]>

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 6, 2024

moyueheng mentioned this issue Oct 6, 2024

fix: 🐛 PDF Filter Type Error #27154

Merged

1 task

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 5, 2025

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2025

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 12, 2025

jiangtongxueya mentioned this issue Jan 23, 2025

community: Fix the problem of error reporting when OCR extracts text from PDF. #29378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

moyueheng commented Oct 6, 2024 •

edited

Loading

dosubot bot commented Jan 5, 2025

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

Comments

moyueheng commented Oct 6, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

dosubot bot commented Jan 5, 2025

moyueheng commented Oct 6, 2024 •

edited

Loading