langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

AhmedTammaa · 2024-10-28T11:02:52Z

This pull request updates the HTMLHeaderTextSplitter by replacing the split_text_from_file method's implementation. The original method used lxml and XSLT for processing HTML files, which caused lxml.etree.xsltapplyerror maxhead when handling large HTML documents due to limitations in the XSLT processor. Fixes #13149

By switching to BeautifulSoup (bs4), we achieve:

Improved Performance and Reliability: BeautifulSoup efficiently processes large HTML files without the errors associated with lxml and XSLT.
Simplified Dependencies: Removes the dependency on lxml and external XSLT files, relying instead on the widely used beautifulsoup4 library.
Maintained Functionality: The new method replicates the original behavior, ensuring compatibility with existing code and preserving the extraction of content and metadata.

Issue:

This change addresses issues related to processing large HTML files with the existing HTMLHeaderTextSplitter implementation. It resolves problems where users encounter lxml.etree.xsltapplyerror maxhead due to large HTML documents.

Dependencies:

BeautifulSoup (beautifulsoup4): The beautifulsoup4 library is now used for parsing HTML content.
- Installation: pip install beautifulsoup4

Code Changes:

Updated the split_text_from_file method in HTMLHeaderTextSplitter as follows:

def split_text_from_file(self, file: Any) -> List[Document]:
    """Split HTML file using BeautifulSoup.

    Args:
        file: HTML file path or file-like object.

    Returns:
        List of Document objects with page_content and metadata.
    """
    from bs4 import BeautifulSoup
    from langchain.docstore.document import Document
    import bs4

    # Read the HTML content from the file or file-like object
    if isinstance(file, str):
        with open(file, 'r', encoding='utf-8') as f:
            html_content = f.read()
    else:
        # Assuming file is a file-like object
        html_content = file.read()

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract the header tags and their corresponding metadata keys
    headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
    header_mapping = dict(self.headers_to_split_on)

    documents = []

    # Find the body of the document
    body = soup.body if soup.body else soup

    # Find all header tags in the order they appear
    all_headers = body.find_all(headers_to_split_on)

    # If there's content before the first header, collect it
    first_header = all_headers[0] if all_headers else None
    if first_header:
        pre_header_content = ''
        for elem in first_header.find_all_previous():
            if isinstance(elem, bs4.Tag):
                text = elem.get_text(separator=' ', strip=True)
                if text:
                    pre_header_content = text + ' ' + pre_header_content
        if pre_header_content.strip():
            documents.append(Document(
                page_content=pre_header_content.strip(),
                metadata={}  # No metadata since there's no header
            ))
    else:
        # If no headers are found, return the whole content
        full_text = body.get_text(separator=' ', strip=True)
        if full_text.strip():
            documents.append(Document(
                page_content=full_text.strip(),
                metadata={}
            ))
        return documents

    # Process each header and its associated content
    for header in all_headers:
        current_metadata = {}
        header_name = header.name
        header_text = header.get_text(separator=' ', strip=True)
        current_metadata[header_mapping[header_name]] = header_text

        # Collect all sibling elements until the next header of the same or higher level
        content_elements = []
        for sibling in header.find_next_siblings():
            if sibling.name in headers_to_split_on:
                # Stop at the next header
                break
            if isinstance(sibling, bs4.Tag):
                content_elements.append(sibling)

        # Get the text content of the collected elements
        current_content = ''
        for elem in content_elements:
            text = elem.get_text(separator=' ', strip=True)
            if text:
                current_content += text + ' '

        # Create a Document if there is content
        if current_content.strip():
            documents.append(Document(
                page_content=current_content.strip(),
                metadata=current_metadata.copy()
            ))
        else:
            # If there's no content, but we have metadata, still create a Document
            documents.append(Document(
                page_content='',
                metadata=current_metadata.copy()
            ))

    return documents

used bs4 for larger html file processing

vercel · 2024-10-28T11:02:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Jan 20, 2025 9:02pm

AhmedTammaa · 2024-10-28T11:46:36Z

Hi @eyurtsev

Could you take a look at this, please?

updated according to linter tests

eyurtsev · 2024-12-11T23:27:05Z

@AhmedTammaa this is a great change -- it'll make it much easier to understand what's going on.

Given that you're already taken a deep dive into this code, would you be able to help define the precise semantics for what the splitter does?

Based on the code:

class HTMLHeaderTextSplitter:
    """
    Splitting HTML files based on specified headers.
    Requires lxml package.
    """
    def __init__(
        self,
        headers_to_split_on: List[Tuple[str, str]],
        return_each_element: bool = False,
    ):
        """Create a new HTMLHeaderTextSplitter.
        Args:
            headers_to_split_on: list of tuples of headers we want to track mapped to
                (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4,
                h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2)].
            return_each_element: Return each element w/ associated headers.
        """

It's a bit hard to understand from the doc-string, but what's the actual of each document chunk?

What does the content for a given header contain. (Does the h1 header contain the h2 header is we're splitting on h1 and h2)
Can we split only on h2, but not h1?
what metadata does each document contain?

AhmedTammaa · 2024-12-12T15:48:51Z

Hi @eyurtsev

Thanks for the feedback! Here’s what’s going on:

The splitter takes an HTML page and creates Document objects, each representing a chunk of text under a certain “header context.”
Before the first header appears, any text is grouped into a “pre-header” chunk with no metadata.
Once we hit a header (like <h1>, <h2>, etc.), we finalize the previous chunk and start a new one. The headers you choose in headers_to_split_on determine what levels we actually split on and what gets stored in the metadata. For example:
- [("h1", "Header 1"), ("h2", "Header 2")] means <h1> text goes into "Header 1" in metadata and <h2> text goes into "Header 2".
If you only pick h2 and h3 to split on, the code ignores h1 for splitting and metadata. It just won’t create a new chunk at h1.
Each Document’s page_content is basically the combined text of <p> elements (and whatever tags you specify) under the current headers. Its metadata is a dict of header levels (e.g. {"Header 1": "Some Title", "Header 2": "Some Subsection"}).

I have also modified the provided code and made further tests:

I prepared this small notebook on Google Colab to show you use cases old vs new code

It is public and anyone with the link can see it.

The last experiment shows the main reason for change where the old class fails with larger files

I tested it on a generated HTML file

If you have any more questions please let me know.
Thanks for your effort in Langchain it makes my life easier and that's why I am eager to contribute to it :)

eyurtsev · 2024-12-13T20:38:17Z

@AhmedTammaa OK this makes sense and will be very happy to merge the code. if we can do the following:

Add a unit test for at least one, ideally two more test cases:

A) split on h1 only (but document contains h2 / h3 tags)
B) split on h1 only (but document does not contain any header tags)

The existing unit tests are here: https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/tests/unit_tests/test_text_splitters.py#L1638-L1638

You can use pytest.parameterize() and I'd suggest just asking chatgpt (or your favorite chat model) to write it. It'll help make sure that the semantics are well defined.

Address the issue w/ the imports (flagged via a comment)
If you're able improve the doc-string for the API reference (here too you can probably ask your favorite chat model to update the API reference using google style doc-string and the content of your explanation + test cases)

libs/text-splitters/langchain_text_splitters/html.py

AhmedTammaa · 2024-12-14T21:24:30Z

Alright @eyurtsev

I will be working on that. I will do a lot of testing before pushing. I am currently studying the old behaviour deeply, so I can reproduce it (ideally, exactly the same) or with minor changes.

The way I will be testing is asking an llm to generate some sample documents and I compare old output to new output which should be the same (at the moment it isn't).

Or,

There can be more flexibility?

eyurtsev · 2024-12-16T15:42:27Z

I compare old output to new output which should be the same (at the moment it isn't).

Likely OK if the output isn't the same. Based on how the splitter is supposed to work, it sounds like we can well define what the behavior should be for each of the test cases.

So let's create some simple test cases, and verify that the output we get is as expected.

AhmedTammaa · 2024-12-16T20:44:54Z

Hi @eyurtsev,

Thanks for answering my query.

Here is a summary of the behaviour and the added test cases.

Summary of `HTMLHeaderTextSplitter` Behavior

Document Chunk Content:
- Headers: Each specified header (e.g., <h1>, <h2>) becomes a separate Document containing the header text.
- Content: Text following a header up to the next header of the same or higher level is grouped into a Document. When splitting on multiple headers (e.g., <h1> and <h2>), higher-level headers do not contain lower-level ones; instead, metadata reflects the hierarchy.
Splitting Only on Specific Headers:
- You can choose to split on any combination of headers. For example, splitting only on <h2> while ignoring <h1> is supported by specifying only ("h2", "Header 2") in headers_to_split_on.
Metadata in Each Document:
- Structure: Each Document includes metadata mapping the specified headers to their respective texts.
- Hierarchy: For nested headers, metadata accumulates to reflect the document structure. For example, a <h2> under a <h1> will have both Header 1 and Header 2 in its metadata.

Test Cases Overview

Test Case A:
Splits on <h1>, <h2>, and <h3> within a nested structure, ensuring each header and its content are correctly segmented with appropriate metadata.
Test Case B:
Splits on <h1> only in a document that contains no headers, resulting in a single aggregated Document without metadata.
Added Test Cases
I have added edge cases where elements are on different depths, nested structure and etc.
Here is the full test set now:

Test Code

@pytest.fixture
def html_header_splitter_splitter_factory() -> HTMLHeaderTextSplitter:
    """
    Fixture to create an HTMLHeaderTextSplitter instance with given headers.
    This factory allows dynamic creation of splitters with different headers.
    """
    def _create_splitter(headers_to_split_on: List[Tuple[str, str]]) -> HTMLHeaderTextSplitter:
        return HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    return _create_splitter


@pytest.mark.parametrize(
    "headers_to_split_on, html_input, expected_documents, test_case",
    [
        (
            # Test Case 1: Split on h1 and h2
            [("h1", "Header 1"), ("h2", "Header 2")],
            """
            <html>
                <body>
                    <h1>Introduction</h1>
                    <p>This is the introduction.</p>
                    <h2>Background</h2>
                    <p>Background information.</p>
                    <h1>Conclusion</h1>
                    <p>Final thoughts.</p>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Introduction",
                    metadata={"Header 1": "Introduction"}
                ),
                Document(
                    page_content="This is the introduction.",
                    metadata={"Header 1": "Introduction"}
                ),
                Document(
                    page_content="Background",
                    metadata={
                        "Header 1": "Introduction",
                        "Header 2": "Background"
                    }
                ),
                Document(
                    page_content="Background information.",
                    metadata={
                        "Header 1": "Introduction",
                        "Header 2": "Background"
                    }
                ),
                Document(
                    page_content="Conclusion",
                    metadata={"Header 1": "Conclusion"}
                ),
                Document(
                    page_content="Final thoughts.",
                    metadata={"Header 1": "Conclusion"}
                )
            ],
            "Simple headers and paragraphs"
        ),
        (
            # Test Case 2: Nested headers with h1, h2, and h3
            [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")],
            """
            <html>
                <body>
                    <div>
                        <h1>Main Title</h1>
                        <div>
                            <h2>Subsection</h2>
                            <p>Details of subsection.</p>
                            <div>
                                <h3>Sub-subsection</h3>
                                <p>More details.</p>
                            </div>
                        </div>
                    </div>
                    <h1>Another Main Title</h1>
                    <p>Content under another main title.</p>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Main Title",
                    metadata={"Header 1": "Main Title"}
                ),
                Document(
                    page_content="Subsection",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection"
                    }
                ),
                Document(
                    page_content="Details of subsection.",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection"
                    }
                ),
                Document(
                    page_content="Sub-subsection",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection",
                        "Header 3": "Sub-subsection"
                    }
                ),
                Document(
                    page_content="More details.",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection",
                        "Header 3": "Sub-subsection"
                    }
                ),
                Document(
                    page_content="Another Main Title",
                    metadata={"Header 1": "Another Main Title"}
                ),
                Document(
                    page_content="Content under another main title.",
                    metadata={"Header 1": "Another Main Title"}
                )
            ],
            "Nested headers with h1, h2, and h3"
        ),
        (
            # Test Case 3: No headers
            [("h1", "Header 1")],
            """
            <html>
                <body>
                    <p>Paragraph one.</p>
                    <p>Paragraph two.</p>
                    <div>
                        <p>Paragraph three.</p>
                    </div>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Paragraph one.  \nParagraph two.  \nParagraph three.",
                    metadata={}
                )
            ],
            "No headers present"
        ),
        (
            # Test Case 4: Multiple headers of the same level
            [("h1", "Header 1")],
            """
            <html>
                <body>
                    <h1>Chapter 1</h1>
                    <p>Content of chapter 1.</p>
                    <h1>Chapter 2</h1>
                    <p>Content of chapter 2.</p>
                    <h1>Chapter 3</h1>
                    <p>Content of chapter 3.</p>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Chapter 1",
                    metadata={"Header 1": "Chapter 1"}
                ),
                Document(
                    page_content="Content of chapter 1.",
                    metadata={"Header 1": "Chapter 1"}
                ),
                Document(
                    page_content="Chapter 2",
                    metadata={"Header 1": "Chapter 2"}
                ),
                Document(
                    page_content="Content of chapter 2.",
                    metadata={"Header 1": "Chapter 2"}
                ),
                Document(
                    page_content="Chapter 3",
                    metadata={"Header 1": "Chapter 3"}
                ),
                Document(
                    page_content="Content of chapter 3.",
                    metadata={"Header 1": "Chapter 3"}
                )
            ],
            "Multiple headers of the same level"
        ),
        (
            # Test Case 5: Headers with no content
            [("h1", "Header 1"), ("h2", "Header 2")],
            """
            <html>
                <body>
                    <h1>Header 1</h1>
                    <h2>Header 2</h2>
                    <h1>Header 3</h1>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Header 1",
                    metadata={"Header 1": "Header 1"}
                ),
                Document(
                    page_content="Header 2",
                    metadata={
                        "Header 1": "Header 1",
                        "Header 2": "Header 2"
                    }
                ),
                Document(
                    page_content="Header 3",
                    metadata={"Header 1": "Header 3"}
                )
            ],
            "Headers with no associated content"
        ),
    ]
)
def test_html_header_text_splitter(

    html_header_splitter_splitter_factory: Any,
    headers_to_split_on: List[Tuple[str, str]],
    html_input: str,
    expected_documents: List[Document],
    test_case: str
):
    """
    Test the HTML header text splitter.

    Args:
        html_header_splitter_splitter_factory (Any): Factory function to create
            the HTML header splitter.
        headers_to_split_on (List[Tuple[str, str]]): List of headers to split on.
        html_input (str): The HTML input string to be split.
        expected_documents (List[Document]): List of expected Document objects.
        test_case (str): Description of the test case.

    Raises:
        AssertionError: If the number of documents or their content/metadata
            does not match the expected values.
    """

    splitter = html_header_splitter_splitter_factory(headers_to_split_on=headers_to_split_on)
    docs = splitter.split_text(html_input)

    assert len(docs) == len(expected_documents), (
        f"Test Case '{test_case}' Failed: Number of documents mismatch. "
        f"Expected {len(expected_documents)}, got {len(docs)}."
    )
    for idx, (doc, expected) in enumerate(zip(docs, expected_documents), start=1):
        assert doc.page_content == expected.page_content, (
            f"Test Case '{test_case}' Failed at Document {idx}: "
            f"Content mismatch.\nExpected: {expected.page_content}\nGot: {doc.page_content}"
        )
        assert doc.metadata == expected.metadata, (
            f"Test Case '{test_case}' Failed at Document {idx}: "
            f"Metadata mismatch.\nExpected: {expected.metadata}\nGot: {doc.metadata}"
        )


@pytest.mark.parametrize(
    "headers_to_split_on, html_content, expected_output, test_case",
    [
        (
            # Test Case A: Split on h1 and h2 with h3 in content
            [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")],
            """
            <!DOCTYPE html>
            <html>
            <body>
                <div>
                    <h1>Foo</h1>
                    <p>Some intro text about Foo.</p>
                    <div>
                        <h2>Bar main section</h2>
                        <p>Some intro text about Bar.</p>
                        <h3>Bar subsection 1</h3>
                        <p>Some text about the first subtopic of Bar.</p>
                        <h3>Bar subsection 2</h3>
                        <p>Some text about the second subtopic of Bar.</p>
                    </div>
                    <div>
                        <h2>Baz</h2>
                        <p>Some text about Baz</p>
                    </div>
                    <br>
                    <p>Some concluding text about Foo</p>
                </div>
            </body>
            </html>
            """,
            [
                Document(
                    metadata={'Header 1': 'Foo'},
                    page_content='Foo'
                ),
                Document(
                    metadata={'Header 1': 'Foo'},
                    page_content='Some intro text about Foo.'
                ),
                Document(
                    metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'},
                    page_content='Bar main section'
                ),
                Document(
                    metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'},
                    page_content='Some intro text about Bar.'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 1'
                    },
                    page_content='Bar subsection 1'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 1'
                    },
                    page_content='Some text about the first subtopic of Bar.'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 2'
                    },
                    page_content='Bar subsection 2'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 2'
                    },
                    page_content='Some text about the second subtopic of Bar.'
                ),
                Document(
                    metadata={'Header 1': 'Foo', 'Header 2': 'Baz'},
                    page_content='Baz'
                ),
                Document(
                    metadata={'Header 1': 'Foo'},
                    page_content='Some text about Baz  \nSome concluding text about Foo'
                )
            ],
            "Test Case A: Split on h1, h2, and h3 with nested headers"
        ),
        (
            # Test Case B: Split on h1 only without any headers
            [("h1", "Header 1")],
            """
            <html>
                <body>
                    <p>Paragraph one.</p>
                    <p>Paragraph two.</p>
                    <p>Paragraph three.</p>
                </body>
            </html>
            """,
            [
                Document(
                    metadata={},
                    page_content='Paragraph one.  \nParagraph two.  \nParagraph three.'
                )
            ],
            "Test Case B: Split on h1 only without any headers"
        )
    ]
)
def test_additional_html_header_text_splitter(
    html_header_splitter_splitter_factory: Any,
    headers_to_split_on: List[Tuple[str, str]],
    html_content: str,
    expected_output: List[Document],
    test_case: str
):
    """
    Test the HTML header text splitter.

    Args:
        html_header_splitter_splitter_factory (Any): Factory function to create
            the HTML header splitter.
        headers_to_split_on (List[Tuple[str, str]]): List of headers to split on.
        html_content (str): HTML content to be split.
        expected_output (List[Document]): Expected list of Document objects.
        test_case (str): Description of the test case.

    Raises:
        AssertionError: If the number of documents or their content/metadata
            does not match the expected output.
    """
    splitter = html_header_splitter_splitter_factory(headers_to_split_on=headers_to_split_on)
    docs = splitter.split_text(html_content)



    assert len(docs) == len(expected_output), (
        f"{test_case} Failed: Number of documents mismatch. "
        f"Expected {len(expected_output)}, got {len(docs)}."
    )
    for idx, (doc, expected) in enumerate(zip(docs, expected_output), start=1):
        assert doc.page_content == expected.page_content, (
            f"{test_case} Failed at Document {idx}: "
            f"Content mismatch.\nExpected: {expected.page_content}\nGot: {doc.page_content}"
        )
        assert doc.metadata == expected.metadata, (
            f"{test_case} Failed at Document {idx}: "
            f"Metadata mismatch.\nExpected: {expected.metadata}\nGot: {doc.metadata}"
        )


@pytest.mark.parametrize(
    "headers_to_split_on, html_content, expected_output, test_case",
    [
        (
            # Test Case C: Split on h1, h2, and h3 with no headers present
            [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")],
            """
            <html>
                <body>
                    <p>Just some random text without headers.</p>
                    <div>
                        <span>More text here.</span>
                    </div>
                </body>
            </html>
            """,
            [
                Document(
                    page_content='Just some random text without headers.  \nMore text here.',
                    metadata={}
                )
            ],
            "Test Case C: Split on h1, h2, and h3 without any headers"
        )
    ]
)
def test_no_headers_with_multiple_splitters(
    html_header_splitter_splitter_factory: Any,
    headers_to_split_on: List[Tuple[str, str]],
    html_content: str,
    expected_output: List[Document],
    test_case: str
):
    """
    Test HTML content splitting without headers using multiple splitters.
    Args:
        html_header_splitter_splitter_factory (Any): Factory to create the
            HTML header splitter.
        headers_to_split_on (List[Tuple[str, str]]): List of headers to split on.
        html_content (str): HTML content to be split.
        expected_output (List[Document]): Expected list of Document objects
            after splitting.
        test_case (str): Description of the test case.
    Raises:
        AssertionError: If the number of documents or their content/metadata
            does not match the expected output.
    """
    splitter = html_header_splitter_splitter_factory(headers_to_split_on=headers_to_split_on)
    docs = splitter.split_text(html_content)



    assert len(docs) == len(expected_output), (
        f"{test_case} Failed: Number of documents mismatch. "
        f"Expected {len(expected_output)}, got {len(docs)}."
    )
    for idx, (doc, expected) in enumerate(zip(docs, expected_output), start=1):
        assert doc.page_content == expected.page_content, (
            f"{test_case} Failed at Document {idx}: "
            f"Content mismatch.\nExpected: {expected.page_content}\nGot: {doc.page_content}"
        )
        assert doc.metadata == expected.metadata, (
            f"{test_case} Failed at Document {idx}: "
            f"Metadata mismatch.\nExpected: {expected.metadata}\nGot: {doc.metadata}"
        )

Currently, I have the class is stable and passes all of them. I am doing code-cleaning and doing performance optimization.

The current class also deals with the document as a "tree" and no recursion so we can avoid future problems like the recursion limit exceeded.

Rewrote the HTMLHeaderSplitter with BS4 to support large files

added extra tests for the new HTML class

AhmedTammaa · 2024-12-16T22:32:57Z

@eyurtsev So I have committed the new updates. Please let me know if there are further changes we need to make.

AhmedTammaa · 2024-12-20T02:07:04Z

Hi @eyurtsev, I think you can have a look now. I have made the required changes. I figured out how to make the dependency optional by looking at other scripts in the repo. I made the documentation much clearer.

I have learnt a lot from this one. It is my first contribution to a project on such a scale, and I hope it gets merged. I am eager to contribute more in the future with the experience I have gained from this one.

Lastly, if we require any more changes please let me know.

eyurtsev · 2024-12-20T16:10:54Z

libs/text-splitters/tests/unit_tests/test_text_splitters.py

+            """,
+            [
+                Document(
+                    page_content="Introduction", metadata={"Header 1": "Introduction"}


Headers: Each specified header (e.g.,
,
) becomes a separate Document containing the header text.

This is different from the behavior before. Which situation is this useful for? Is this for a case where an HTML document only contains the
header and no other content inside the header?

How would this be used downstream?

Should we include a boolean flag to toggle this behavior?

When I was checking old behaviour. Sometimes it create a document contains only the header text and then the next one contains the whats in between. So, I wrote it to be consistently creating a document contains the header text and then next one contains the the content inside. There is no strong reason for doing it like that. So, I can change it.

libs/text-splitters/langchain_text_splitters/html.py

AhmedTammaa · 2024-12-21T00:55:56Z

@eyurtsev I have simplified the implementation following your suggestions. Now,

No nodes or data class for them.
No tree building, instead, splitting on the fly.
The overall code is much more maintainable than before.

Please let me know if we need to do further modification.

eyurtsev · 2025-01-08T17:52:05Z

I @AhmedTammaa sorry for the delay will review in a bit -- I was out for holidays for a couple of weeks

AhmedTammaa · 2025-01-13T12:43:34Z

hi @eyurtsev.

I know that you are probably having a lot of code to review since the last break. This is just a gentle reminder for this PR :)

eyurtsev · 2025-01-20T21:03:49Z

libs/text-splitters/langchain_text_splitters/html.py

-
-        # document transformation for "structure-aware" chunking is handled with xsl.
-        # see comments in html_chunks_with_headers.xslt for more detailed information.
-        xslt_path = pathlib.Path(__file__).parent / "xsl/html_chunks_with_headers.xslt"


We should probably delete this file as well (can be done in a separate PR)

eyurtsev · 2025-01-20T21:10:10Z

libs/text-splitters/langchain_text_splitters/html.py

+            with open(file, "r", encoding="utf-8") as f:
+                html_content = f.read()
+        else:
+            html_content = file.read()


Implementation looks like it's working based on unit tests, so I think we're good.

I suspect that there's a way to further simplify, I would've tried the following:

create a private method that accepts html source and yields a generator over documents

the method itself would use a queue to do a tree traversal and have some variable to keep track of dom path and current content associated with the given dom path.

Sure, I will try to enhance it even further and create a new PR. The new PR will also delete the file you suggested be removed.

Thank you very much for your invaluable review of my code. I definitely learnt, and I am eager to contribute more to Langchain!!

…29340) This pull request removes the now-unused html_chunks_with_headers.xslt file from the codebase. In a previous update ([PR #27678](#27678)), the HTMLHeaderTextSplitter class was refactored to utilize BeautifulSoup instead of lxml and XSLT for HTML processing. As a result, the html_chunks_with_headers.xslt file is no longer necessary and can be safely deleted to maintain code cleanliness and reduce potential confusion. Issue: N/A Dependencies: N/A

Update html.py

0771f8e

used bs4 for larger html file processing

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package labels Oct 28, 2024

Merge branch 'master' into patch-1

c52667a

efriis assigned eyurtsev Oct 31, 2024

AhmedTammaa added 6 commits November 8, 2024 21:07

Merge branch 'master' into patch-1

8dc8e46

Update html.py

d4efd97

updated according to linter tests

Update html.py

73c001c

Update html.py

9119fe9

Update html.py

7e0ce8e

Merge branch 'master' into patch-1

d604fd1

AhmedTammaa marked this pull request as draft November 8, 2024 20:00

eyurtsev self-requested a review December 11, 2024 23:27

Merge branch 'master' into patch-1

dfe4ee4

eyurtsev reviewed Dec 13, 2024

View reviewed changes

libs/text-splitters/langchain_text_splitters/html.py Outdated Show resolved Hide resolved

AhmedTammaa added 2 commits December 17, 2024 00:11

Update html.py

6bfc158

Rewrote the HTMLHeaderSplitter with BS4 to support large files

Update test_text_splitters.py

b84f13c

added extra tests for the new HTML class

AhmedTammaa marked this pull request as ready for review December 17, 2024 09:40

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. langchain Related to the langchain package and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Dec 17, 2024

vercel bot deployed to Preview December 20, 2024 01:05 View deployment

fixed "line too long" in test_text_splitters

d7ea998

vercel bot deployed to Preview December 20, 2024 01:20 View deployment

fixed linter issues in test_text_splitter.py

2bf3726

vercel bot deployed to Preview December 20, 2024 01:42 View deployment

fixed mypy issues

7dd9f15

vercel bot deployed to Preview December 20, 2024 01:56 View deployment

AhmedTammaa added 2 commits December 20, 2024 01:59

fixed all formatting issues and checked with pre-commit

456c36a

Merge branch 'master' into patch-1

533bc90

eyurtsev reviewed Dec 20, 2024

View reviewed changes

AhmedTammaa added 3 commits December 21, 2024 00:05

Merge branch 'master' into patch-1

f31e4b7

simplified HTMLHeaderSplitter Logic

bbe5616

improved documentation and formatting

5637dc7

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Dec 21, 2024

AhmedTammaa added 2 commits December 23, 2024 18:07

Merge branch 'master' into patch-1

4aaa912

Merge branch 'master' into patch-1

be08dad

Merge branch 'master' into patch-1

e905614

eyurtsev approved these changes Jan 20, 2025

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jan 20, 2025

Merge branch 'master' into patch-1

73e2ae2

eyurtsev approved these changes Jan 20, 2025

View reviewed changes

eyurtsev merged commit d3ed9b8 into langchain-ai:master Jan 20, 2025
45 checks passed

AhmedTammaa mentioned this pull request Jan 21, 2025

text-splitters[patch]: delete unused html_chunks_with_headers.xslt #29340

Merged

AhmedTammaa mentioned this pull request Jan 24, 2025

TextSplitters: Refactor HTMLHeaderTextSplitter for Enhanced Maintainability and Readability #29397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

AhmedTammaa commented Oct 28, 2024 •

edited

Loading

vercel bot commented Oct 28, 2024 •

edited

Loading

AhmedTammaa commented Oct 28, 2024

eyurtsev commented Dec 11, 2024

AhmedTammaa commented Dec 12, 2024

eyurtsev commented Dec 13, 2024

AhmedTammaa commented Dec 14, 2024

eyurtsev commented Dec 16, 2024

AhmedTammaa commented Dec 16, 2024 •

edited

Loading

AhmedTammaa commented Dec 16, 2024 •

edited

Loading

AhmedTammaa commented Dec 20, 2024

eyurtsev Dec 20, 2024

AhmedTammaa Dec 20, 2024

AhmedTammaa commented Dec 21, 2024

eyurtsev commented Jan 8, 2025

AhmedTammaa commented Jan 13, 2025

eyurtsev Jan 20, 2025

eyurtsev Jan 20, 2025

AhmedTammaa Jan 20, 2025

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

Conversation

AhmedTammaa commented Oct 28, 2024 • edited Loading

vercel bot commented Oct 28, 2024 • edited Loading

AhmedTammaa commented Oct 28, 2024

eyurtsev commented Dec 11, 2024

AhmedTammaa commented Dec 12, 2024

I have also modified the provided code and made further tests:

The last experiment shows the main reason for change where the old class fails with larger files

eyurtsev commented Dec 13, 2024

AhmedTammaa commented Dec 14, 2024

eyurtsev commented Dec 16, 2024

AhmedTammaa commented Dec 16, 2024 • edited Loading

Summary of HTMLHeaderTextSplitter Behavior

Test Cases Overview

AhmedTammaa commented Dec 16, 2024 • edited Loading

AhmedTammaa commented Dec 20, 2024

eyurtsev Dec 20, 2024

Choose a reason for hiding this comment

,

) becomes a separate Document containing the header text.

header and no other content inside the header? How would this be used downstream? Should we include a boolean flag to toggle this behavior?

AhmedTammaa Dec 20, 2024

Choose a reason for hiding this comment

AhmedTammaa commented Dec 21, 2024

eyurtsev commented Jan 8, 2025

AhmedTammaa commented Jan 13, 2025

eyurtsev Jan 20, 2025

Choose a reason for hiding this comment

eyurtsev Jan 20, 2025

Choose a reason for hiding this comment

AhmedTammaa Jan 20, 2025

Choose a reason for hiding this comment

AhmedTammaa commented Oct 28, 2024 •

edited

Loading

vercel bot commented Oct 28, 2024 •

edited

Loading

AhmedTammaa commented Dec 16, 2024 •

edited

Loading

Summary of `HTMLHeaderTextSplitter` Behavior

AhmedTammaa commented Dec 16, 2024 •

edited

Loading

header and no other content inside the header?

How would this be used downstream?

Should we include a boolean flag to toggle this behavior?