Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does HTML Doc Loader accept .htm sites? #14545

Closed
2 of 14 tasks
jimmy2933 opened this issue Dec 11, 2023 · 2 comments
Closed
2 of 14 tasks

Does HTML Doc Loader accept .htm sites? #14545

jimmy2933 opened this issue Dec 11, 2023 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@jimmy2933
Copy link

System Info

Langchain 0.0.331, macOS Monterey, Python 3.10.9

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

from langchain.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader("https://www.sec.gov/ix?doc=/Archives/edgar/data/40987/000004098720000010/gpc-12312019x10k.htm")
documents = loader.load()

FileNotFoundError: [Errno 2] No such file or directory: 'https://www.sec.gov/ix?doc=/Archives/edgar/data/40987/000004098720000010/gpc-12312019x10k.htm

Expected behavior

Success loading .htm file

Copy link

dosubot bot commented Dec 11, 2023

🤖

Hello @jimmy2933! 👋 I'm Dosu, a friendly bot who's here to assist while you're waiting for a human maintainer. I can help you solve bugs, answer questions, and guide you to become a contributor. Remember, no issue is too big or small for us to tackle together! 😄

Based on the information you've provided, it seems like you're encountering an issue with the UnstructuredHTMLLoader not being able to load .htm files. This could be due to a regression in the WebBaseLoader as mentioned in a similar issue in the LangChain repository.

The UnstructuredHTMLLoader uses a method called _get_elements() to load and partition the HTML file into different elements. Here is the code for the _get_elements() method:

def _get_elements(self) -> List:
    from unstructured.partition.html import partition_html

    return partition_html(filename=self.file_path, **self.unstructured_kwargs)

(Source: libs/langchain/langchain/document_loaders/html.py)

This method uses the partition_html function from the unstructured.partition.html module to handle the loading of .htm files. If the partition_html function is not able to handle .htm files, it could be the cause of the issue you're experiencing.

However, without more information, it's hard to say for certain what the exact issue is. Could you please provide more details about the error message you're receiving? Specifically, it would be helpful to know if there's any additional information in the error message about where in the code the error is occurring. This could help pinpoint the exact cause of the issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Dec 11, 2023
@chyroc
Copy link
Contributor

chyroc commented Jan 3, 2024

try:

import requests
from langchain.document_loaders import UnstructuredHTMLLoader

source_url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/40987/000004098720000010/gpc-12312019x10k.htm"
r = requests.get(source_url)
with open("12312019x10k.htm", "w") as f:
    f.write(r.text)
loader = UnstructuredHTMLLoader("12312019x10k.htm")
documents = loader.load()
print(documents)

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 3, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

2 participants