feat: Add BSHTMLLoader support and enhance error handling for document loading #1166
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR enhances the DocumentLoader class by adding support for HTML documents and improving error handling. The changes include:
Added BSHTMLLoader support:
Imported BSHTMLLoader from langchain_community.document_loaders
Added handlers for both .html and .htm file extensions
BSHTMLLoader provides better HTML parsing capabilities compared to basic text loading
Improved error handling:
Added specific try-catch block for document loading operations
Enhanced error messages to differentiate between HTML-specific and general document loading failures
Provides better debugging information when loading fails
These changes make the document loader more robust and expand its capabilities to handle HTML documents more effectively.
Technical Details:
File modified: gpt_researcher/document/document.py
Added BSHTMLLoader to the imported loaders
Updated loader_dict to include HTML file extensions
Implemented specific error handling for HTML document loading failures