Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebBaseLoader interprets incorrectly the web_path parameter #11180

Closed
2 of 14 tasks
mrtj opened this issue Sep 28, 2023 · 2 comments
Closed
2 of 14 tasks

WebBaseLoader interprets incorrectly the web_path parameter #11180

mrtj opened this issue Sep 28, 2023 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@mrtj
Copy link
Contributor

mrtj commented Sep 28, 2023

System Info

LangChain version: 0.0.304

Who can help?

@eyurt

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

from langchain.document_loaders.web_base import WebBaseLoader
loader = WebBaseLoader("https://www.google.com")
docs = loader.load()

Result: RequestException is raised with the message "Invalid URL 'h': No scheme supplied. Perhaps you meant https://h?"

Expected behavior

The page contents correctly loads to documents.

@mrtj
Copy link
Contributor Author

mrtj commented Sep 28, 2023

Background

The latest update to WebBaseLoader introduces a regression: if you pass a string to the (now deprecated) web_path parameter, it will be incorrectly interpreted and parsed as a sequence, and as a result when you call the load() method, an invalid uri (just the first letter of the original uri) will be passed to beatifulsoup.

Debugging hints

If you pass a string to web_path, the following condition will be evaluated true and as a result, the single characters of the string will be saved as a list to web_paths:

elif isinstance(web_path, Sequence):
self.web_paths = list(web_path)

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 28, 2023
@mrtj
Copy link
Contributor Author

mrtj commented Sep 28, 2023

Duplicate of #11095

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

1 participant