Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle URL line breaks with new URLContentLoader #28344

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

VaibhavLakshmiS
Copy link

Description:

Problem:
When loading text documents with DirectoryLoader, URLs containing line breaks or special characters (e.g., hyphens, /) are fragmented.

Solution:

  • Added a new URLContentLoader class that extends DirectoryLoader.
  • Implemented _fix_special_chars_in_urls to handle fragmented URLs by consolidating broken lines into complete URLs.

Testing:

  • Added unit tests to verify:
    1. Handling of fragmented URLs with line breaks.
    2. URLs containing special characters.

Documentation

  • Added docstrings for all methods in URLContentLoader.

Dependencies
-None

Checklist:

  • Added unit tests for all new functionalities.
  • Verified that the code adheres to LangChain's linting standards.
  • Included detailed documentation for users and developers

Linked Issue:
Fixes #23849


Contributors

- Added `URLContentLoader` to address issues with URL line breaks and special characters.
- Included tests to validate `URLContentLoader` functionality.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 26, 2024
Copy link

vercel bot commented Nov 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Nov 26, 2024 4:49am

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Triage
Development

Successfully merging this pull request may close these issues.

DirectoryLoader converting characters randomly into new line characters?
1 participant