Does HTML Doc Loader accept .htm sites? #14545

jimmy2933 · 2023-12-11T16:25:09Z

dosubot · 2023-12-11T16:25:27Z

🤖

Hello @jimmy2933! 👋 I'm Dosu, a friendly bot who's here to assist while you're waiting for a human maintainer. I can help you solve bugs, answer questions, and guide you to become a contributor. Remember, no issue is too big or small for us to tackle together! 😄

Based on the information you've provided, it seems like you're encountering an issue with the UnstructuredHTMLLoader not being able to load .htm files. This could be due to a regression in the WebBaseLoader as mentioned in a similar issue in the LangChain repository.

The UnstructuredHTMLLoader uses a method called _get_elements() to load and partition the HTML file into different elements. Here is the code for the _get_elements() method:

def _get_elements(self) -> List:
    from unstructured.partition.html import partition_html

    return partition_html(filename=self.file_path, **self.unstructured_kwargs)

(Source: libs/langchain/langchain/document_loaders/html.py)

This method uses the partition_html function from the unstructured.partition.html module to handle the loading of .htm files. If the partition_html function is not able to handle .htm files, it could be the cause of the issue you're experiencing.

However, without more information, it's hard to say for certain what the exact issue is. Could you please provide more details about the error message you're receiving? Specifically, it would be helpful to know if there's any additional information in the error message about where in the code the error is occurring. This could help pinpoint the exact cause of the issue.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

chyroc · 2024-01-03T08:12:23Z

try:

import requests
from langchain.document_loaders import UnstructuredHTMLLoader

source_url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/40987/000004098720000010/gpc-12312019x10k.htm"
r = requests.get(source_url)
with open("12312019x10k.htm", "w") as f:
    f.write(r.text)
loader = UnstructuredHTMLLoader("12312019x10k.htm")
documents = loader.load()
print(documents)

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Dec 11, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 3, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does HTML Doc Loader accept .htm sites? #14545

Does HTML Doc Loader accept .htm sites? #14545

jimmy2933 commented Dec 11, 2023

dosubot bot commented Dec 11, 2023 •

edited

Loading

chyroc commented Jan 3, 2024

Does HTML Doc Loader accept .htm sites? #14545

Does HTML Doc Loader accept .htm sites? #14545

Comments

jimmy2933 commented Dec 11, 2023

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

dosubot bot commented Dec 11, 2023 • edited Loading

Sources

chyroc commented Jan 3, 2024

dosubot bot commented Dec 11, 2023 •

edited

Loading