Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

Closed
5 tasks done
moyueheng opened this issue Oct 6, 2024 · 1 comment
Closed
5 tasks done

PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153

moyueheng opened this issue Oct 6, 2024 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@moyueheng
Copy link
Contributor

moyueheng commented Oct 6, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

example.pdf

pdf_miner_parser = PDFMinerParser(extract_images=True)
with open("examp.pdf") as f:
    blob = Blob(data=f.read())
    pdf_miner_parser.parse(blob)

image

Error Message and Stack Trace (if applicable)

File "agents/pdf2md/pdf2md_agent.py", line 30, in _process_file
documents = pdf_miner_parser.parse(blob)
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 215, in lazy_parse
content = text_io.getvalue() + self._extract_images_from_page(
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 239, in _extract_images_from_page
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
AttributeError: 'list' object has no attribute 'name'

Description

I think I can fix this bug

System Info

python -m langchain_core.sys_info

System Information

OS: Linux
OS Version: #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023
Python Version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]

Package Information

langchain_core: 0.3.9
langchain: 0.3.2
langchain_community: 0.3.1
langsmith: 0.1.131
langchain_openai: 0.2.2
langchain_text_splitters: 0.3.0
langserve: 0.3.0

Optional packages not installed

langgraph

Other Dependencies

aiohttp: 3.10.9
async-timeout: 4.0.3
dataclasses-json: 0.6.7
fastapi: 0.115.0
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.51.0
orjson: 3.10.7
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.5.2
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
sse-starlette: 1.8.2
tenacity: 8.5.0
tiktoken: 0.8.0
typing-extensions: 4.12.2

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 6, 2024
efriis added a commit that referenced this issue Dec 13, 2024
Thank you for contributing to LangChain!

 **PR title**: "community: fix  PDF Filter Type Error"


  - **Description:** fix  PDF Filter Type Error"
  - **Issue:** the issue #27153 it fixes,
  - **Dependencies:** no
- **Twitter handle:** if your PR gets announced, and you'd like a
mention, we'll gladly shout you out!



- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <[email protected]>
Copy link

dosubot bot commented Jan 5, 2025

Hi, @moyueheng. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The issue involves a bug in the PDFMinerParser component of LangChain.
  • An AttributeError occurs due to incorrect access of a 'name' attribute on a list object.
  • The error arises when the parser fails to identify the filter type during PDF processing.
  • You have provided example code and a PDF file to demonstrate the issue.
  • There have been no further comments or activity on this issue.

Next Steps:

  • Please confirm if this issue is still relevant with the latest version of LangChain. If so, you can keep the discussion open by commenting here.
  • If there is no response, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 5, 2025
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2025
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 12, 2025
ccurme added a commit that referenced this issue Jan 23, 2025
…from PDF. (#29378)

- **Description:** The issue has been fixed where images could not be
recognized from ```xObject[obj]["/Filter"]``` (whose value can be either
a string or a list of strings) in the ```_extract_images_from_page()```
method. It also resolves the bug where vectorization by Faiss fails due
to the failure of image extraction from a PDF containing only
images```IndexError: list index out of range```.

![69a60f3f6bd474641b9126d74bb18f7e](https://github.com/user-attachments/assets/dc9e098d-2862-49f7-93b0-00f1056727dc)

- **Issue:** 
    Fix the following issues:
[#15227 ](#15227)
[#22892 ](#22892)
[#26652 ](#26652)
[#27153 ](#27153)
    Related issues:
[#7067 ](#7067)

- **Dependencies:** None
- **Twitter handle:** None

---------

Co-authored-by: Chester Curme <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant