-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDFMinerParser Bug: Failed to Recognize Filter Type Error #27153
Labels
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
Comments
dosubot
bot
added
the
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
label
Oct 6, 2024
efriis
added a commit
that referenced
this issue
Dec 13, 2024
Thank you for contributing to LangChain! **PR title**: "community: fix PDF Filter Type Error" - **Description:** fix PDF Filter Type Error" - **Issue:** the issue #27153 it fixes, - **Dependencies:** no - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <[email protected]>
Hi, @moyueheng. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale. Issue Summary:
Next Steps:
Thank you for your understanding and contribution! |
dosubot
bot
added
the
stale
Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
label
Jan 5, 2025
dosubot
bot
removed
the
stale
Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
label
Jan 12, 2025
ccurme
added a commit
that referenced
this issue
Jan 23, 2025
…from PDF. (#29378) - **Description:** The issue has been fixed where images could not be recognized from ```xObject[obj]["/Filter"]``` (whose value can be either a string or a list of strings) in the ```_extract_images_from_page()``` method. It also resolves the bug where vectorization by Faiss fails due to the failure of image extraction from a PDF containing only images```IndexError: list index out of range```. ![69a60f3f6bd474641b9126d74bb18f7e](https://github.com/user-attachments/assets/dc9e098d-2862-49f7-93b0-00f1056727dc) - **Issue:** Fix the following issues: [#15227 ](#15227) [#22892 ](#22892) [#26652 ](#26652) [#27153 ](#27153) Related issues: [#7067 ](#7067) - **Dependencies:** None - **Twitter handle:** None --------- Co-authored-by: Chester Curme <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Checked other resources
Example Code
example.pdf
Error Message and Stack Trace (if applicable)
File "agents/pdf2md/pdf2md_agent.py", line 30, in _process_file
documents = pdf_miner_parser.parse(blob)
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
return list(self.lazy_parse(blob))
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 215, in lazy_parse
content = text_io.getvalue() + self._extract_images_from_page(
File "/share_data/nfs_share/myh_dev/02-DP/md-is-all-you-need/.venv/lib/python3.10/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 239, in _extract_images_from_page
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
AttributeError: 'list' object has no attribute 'name'
Description
I think I can fix this bug
System Info
python -m langchain_core.sys_info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: