Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

Merged
merged 48 commits into from
Jan 20, 2025
Merged
Changes from 1 commit
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
0771f8e
Update html.py
AhmedTammaa Oct 28, 2024
c52667a
Merge branch 'master' into patch-1
AhmedTammaa Oct 29, 2024
8dc8e46
Merge branch 'master' into patch-1
AhmedTammaa Nov 8, 2024
d4efd97
Update html.py
AhmedTammaa Nov 8, 2024
73c001c
Update html.py
AhmedTammaa Nov 8, 2024
9119fe9
Update html.py
AhmedTammaa Nov 8, 2024
7e0ce8e
Update html.py
AhmedTammaa Nov 8, 2024
d604fd1
Merge branch 'master' into patch-1
AhmedTammaa Nov 8, 2024
dfe4ee4
Merge branch 'master' into patch-1
eyurtsev Dec 13, 2024
6bfc158
Update html.py
AhmedTammaa Dec 16, 2024
b84f13c
Update test_text_splitters.py
AhmedTammaa Dec 16, 2024
6a2f1e9
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
17ae8b9
added import Tuple
AhmedTammaa Dec 17, 2024
be9de90
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
851ba7e
Merge branch 'master' into patch-1
AhmedTammaa Dec 17, 2024
0306951
added beautifulsoup4 to poetry depedencies
AhmedTammaa Dec 17, 2024
09e7852
Merge branch 'master' into patch-1
AhmedTammaa Dec 18, 2024
ae50b32
discarded bs4 dependency
AhmedTammaa Dec 18, 2024
f9a93d0
Removed uncessary module docstring, updated docstring of HTMLHeaderTe…
AhmedTammaa Dec 18, 2024
438aedd
improved docstring for the class `HTMLHeaderTextSplitter`
AhmedTammaa Dec 18, 2024
d573723
removed typing from docstring when type is hinted.
AhmedTammaa Dec 18, 2024
405ea70
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
f6e45e2
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
617e04a
Merge branch 'master' into patch-1
AhmedTammaa Dec 19, 2024
b82bfc9
added pytest mark require bs4
AhmedTammaa Dec 19, 2024
4297787
added requirement bs4 marker for the test cases
AhmedTammaa Dec 19, 2024
c2107b1
all test function involving HTMLHeaderTextSplitter has bs4 requirment…
AhmedTammaa Dec 19, 2024
4261885
added bs4 import in the split_file_function and removed it from top l…
AhmedTammaa Dec 19, 2024
567318a
fixing linting errors and improved documentation for HTMLHeaderTextSp…
AhmedTammaa Dec 19, 2024
53685eb
fixed docstring issue and sorted imports
AhmedTammaa Dec 19, 2024
9ff0bfa
sorted imports and defined `nodes` in `_generate_documents` docstring
AhmedTammaa Dec 19, 2024
aeae28c
updated import order
AhmedTammaa Dec 19, 2024
e67f6bd
fixed all linting issues with Ruff
AhmedTammaa Dec 20, 2024
3b8a547
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
cdd62b7
removed extra blank space from `_finalize_chunk`
AhmedTammaa Dec 20, 2024
b4d4e57
added types for untyped function paramters. Typed `stack` variable as…
AhmedTammaa Dec 20, 2024
d7ea998
fixed "line too long" in test_text_splitters
AhmedTammaa Dec 20, 2024
2bf3726
fixed linter issues in test_text_splitter.py
AhmedTammaa Dec 20, 2024
7dd9f15
fixed mypy issues
AhmedTammaa Dec 20, 2024
456c36a
fixed all formatting issues and checked with pre-commit
AhmedTammaa Dec 20, 2024
533bc90
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
f31e4b7
Merge branch 'master' into patch-1
AhmedTammaa Dec 20, 2024
bbe5616
simplified HTMLHeaderSplitter Logic
AhmedTammaa Dec 21, 2024
5637dc7
improved documentation and formatting
AhmedTammaa Dec 21, 2024
4aaa912
Merge branch 'master' into patch-1
AhmedTammaa Dec 23, 2024
be08dad
Merge branch 'master' into patch-1
AhmedTammaa Jan 6, 2025
e905614
Merge branch 'master' into patch-1
AhmedTammaa Jan 9, 2025
73e2ae2
Merge branch 'master' into patch-1
eyurtsev Jan 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update html.py
updated according to linter tests
AhmedTammaa authored Nov 8, 2024
commit d4efd97db21e071f72284ac67b54767a23266634
175 changes: 87 additions & 88 deletions libs/text-splitters/langchain_text_splitters/html.py
Original file line number Diff line number Diff line change
@@ -9,7 +9,9 @@
from langchain_core.documents import Document

from langchain_text_splitters.character import RecursiveCharacterTextSplitter

from bs4 import BeautifulSoup
from bs4.element import Tag
from langchain.docstore.document import Document

class ElementType(TypedDict):
"""Element type as typed dict."""
@@ -91,104 +93,101 @@ def split_text(self, text: str) -> List[Document]:
return self.split_text_from_file(StringIO(text))


def split_text_from_file(self, file: Any) -> List[Document]:
"""Split HTML file using BeautifulSoup.
def split_text_from_file(self, file: Any) -> List[Document]:
"""Split HTML file using BeautifulSoup.

Args:
file: HTML file path or file-like object.
Args:
file: HTML file path or file-like object.

Returns:
List of Document objects with page_content and metadata.
"""
from bs4 import BeautifulSoup
from langchain.docstore.document import Document
import bs4

# Read the HTML content from the file or file-like object
if isinstance(file, str):
with open(file, 'r', encoding='utf-8') as f:
html_content = f.read()
else:
# Assuming file is a file-like object
html_content = file.read()
Returns:
List of Document objects with page_content and metadata.
"""

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Read the HTML content from the file or file-like object
if isinstance(file, str):
with open(file, 'r', encoding='utf-8') as f:
html_content = f.read()
else:
# Assuming file is a file-like object
html_content = file.read()

# Extract the header tags and their corresponding metadata keys
headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
header_mapping = dict(self.headers_to_split_on)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

documents = []
# Extract the header tags and their corresponding metadata keys
headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
header_mapping = dict(self.headers_to_split_on)

# Find the body of the document
body = soup.body if soup.body else soup

# Find all header tags in the order they appear
all_headers = body.find_all(headers_to_split_on)

# If there's content before the first header, collect it
first_header = all_headers[0] if all_headers else None
if first_header:
pre_header_content = ''
for elem in first_header.find_all_previous():
if isinstance(elem, bs4.Tag):
text = elem.get_text(separator=' ', strip=True)
if text:
pre_header_content = text + ' ' + pre_header_content
if pre_header_content.strip():
documents.append(Document(
page_content=pre_header_content.strip(),
metadata={} # No metadata since there's no header
))
else:
# If no headers are found, return the whole content
full_text = body.get_text(separator=' ', strip=True)
if full_text.strip():
documents.append(Document(
page_content=full_text.strip(),
metadata={}
))
return documents

# Process each header and its associated content
for header in all_headers:
current_metadata = {}
header_name = header.name
header_text = header.get_text(separator=' ', strip=True)
current_metadata[header_mapping[header_name]] = header_text

# Collect all sibling elements until the next header of the same or higher level
content_elements = []
for sibling in header.find_next_siblings():
if sibling.name in headers_to_split_on:
# Stop at the next header
break
if isinstance(sibling, bs4.Tag):
content_elements.append(sibling)
documents = []

# Find the body of the document
body = soup.body if soup.body else soup

# Get the text content of the collected elements
current_content = ''
for elem in content_elements:
# Find all header tags in the order they appear
all_headers = body.find_all(headers_to_split_on)

# If there's content before the first header, collect it
first_header = all_headers[0] if all_headers else None
if first_header:
pre_header_content = ''
for elem in first_header.find_all_previous():
if isinstance(elem, Tag):
text = elem.get_text(separator=' ', strip=True)
if text:
current_content += text + ' '

# Create a Document if there is content
if current_content.strip():
documents.append(Document(
page_content=current_content.strip(),
metadata=current_metadata.copy()
))
else:
# If there's no content, but we have metadata, still create a Document
documents.append(Document(
page_content='',
metadata=current_metadata.copy()
))

pre_header_content = text + ' ' + pre_header_content
if pre_header_content.strip():
documents.append(Document(
page_content=pre_header_content.strip(),
metadata={} # No metadata since there's no header
))
else:
# If no headers are found, return the whole content
full_text = body.get_text(separator=' ', strip=True)
if full_text.strip():
documents.append(Document(
page_content=full_text.strip(),
metadata={}
))
return documents

# Process each header and its associated content
for header in all_headers:
current_metadata = {}
header_name = header.name
header_text = header.get_text(separator=' ', strip=True)
current_metadata[header_mapping[header_name]] = header_text

# Collect all sibling elements until the next header of the same or higher level
content_elements = []
for sibling in header.find_next_siblings():
if sibling.name in headers_to_split_on:
# Stop at the next header
break
if isinstance(sibling, Tag):
content_elements.append(sibling)

# Get the text content of the collected elements
current_content = ''
for elem in content_elements:
text = elem.get_text(separator=' ', strip=True)
if text:
current_content += text + ' '

# Create a Document if there is content
if current_content.strip():
documents.append(Document(
page_content=current_content.strip(),
metadata=current_metadata.copy()
))
else:
# If there's no content, but we have metadata, still create a Document
documents.append(Document(
page_content='',
metadata=current_metadata.copy()
))

return documents



class HTMLSectionSplitter: