Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678

Merged
merged 48 commits into from
Jan 20, 2025

Conversation

AhmedTammaa
Copy link
Contributor

@AhmedTammaa AhmedTammaa commented Oct 28, 2024

This pull request updates the HTMLHeaderTextSplitter by replacing the split_text_from_file method's implementation. The original method used lxml and XSLT for processing HTML files, which caused lxml.etree.xsltapplyerror maxhead when handling large HTML documents due to limitations in the XSLT processor. Fixes #13149

By switching to BeautifulSoup (bs4), we achieve:

  • Improved Performance and Reliability: BeautifulSoup efficiently processes large HTML files without the errors associated with lxml and XSLT.
  • Simplified Dependencies: Removes the dependency on lxml and external XSLT files, relying instead on the widely used beautifulsoup4 library.
  • Maintained Functionality: The new method replicates the original behavior, ensuring compatibility with existing code and preserving the extraction of content and metadata.

Issue:

This change addresses issues related to processing large HTML files with the existing HTMLHeaderTextSplitter implementation. It resolves problems where users encounter lxml.etree.xsltapplyerror maxhead due to large HTML documents.

Dependencies:

  • BeautifulSoup (beautifulsoup4): The beautifulsoup4 library is now used for parsing HTML content.
    • Installation: pip install beautifulsoup4

Code Changes:

Updated the split_text_from_file method in HTMLHeaderTextSplitter as follows:

def split_text_from_file(self, file: Any) -> List[Document]:
    """Split HTML file using BeautifulSoup.

    Args:
        file: HTML file path or file-like object.

    Returns:
        List of Document objects with page_content and metadata.
    """
    from bs4 import BeautifulSoup
    from langchain.docstore.document import Document
    import bs4

    # Read the HTML content from the file or file-like object
    if isinstance(file, str):
        with open(file, 'r', encoding='utf-8') as f:
            html_content = f.read()
    else:
        # Assuming file is a file-like object
        html_content = file.read()

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract the header tags and their corresponding metadata keys
    headers_to_split_on = [tag[0] for tag in self.headers_to_split_on]
    header_mapping = dict(self.headers_to_split_on)

    documents = []

    # Find the body of the document
    body = soup.body if soup.body else soup

    # Find all header tags in the order they appear
    all_headers = body.find_all(headers_to_split_on)

    # If there's content before the first header, collect it
    first_header = all_headers[0] if all_headers else None
    if first_header:
        pre_header_content = ''
        for elem in first_header.find_all_previous():
            if isinstance(elem, bs4.Tag):
                text = elem.get_text(separator=' ', strip=True)
                if text:
                    pre_header_content = text + ' ' + pre_header_content
        if pre_header_content.strip():
            documents.append(Document(
                page_content=pre_header_content.strip(),
                metadata={}  # No metadata since there's no header
            ))
    else:
        # If no headers are found, return the whole content
        full_text = body.get_text(separator=' ', strip=True)
        if full_text.strip():
            documents.append(Document(
                page_content=full_text.strip(),
                metadata={}
            ))
        return documents

    # Process each header and its associated content
    for header in all_headers:
        current_metadata = {}
        header_name = header.name
        header_text = header.get_text(separator=' ', strip=True)
        current_metadata[header_mapping[header_name]] = header_text

        # Collect all sibling elements until the next header of the same or higher level
        content_elements = []
        for sibling in header.find_next_siblings():
            if sibling.name in headers_to_split_on:
                # Stop at the next header
                break
            if isinstance(sibling, bs4.Tag):
                content_elements.append(sibling)

        # Get the text content of the collected elements
        current_content = ''
        for elem in content_elements:
            text = elem.get_text(separator=' ', strip=True)
            if text:
                current_content += text + ' '

        # Create a Document if there is content
        if current_content.strip():
            documents.append(Document(
                page_content=current_content.strip(),
                metadata=current_metadata.copy()
            ))
        else:
            # If there's no content, but we have metadata, still create a Document
            documents.append(Document(
                page_content='',
                metadata=current_metadata.copy()
            ))

    return documents

used bs4 for larger html file processing
Copy link

vercel bot commented Oct 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jan 20, 2025 9:02pm

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package labels Oct 28, 2024
@AhmedTammaa
Copy link
Contributor Author

Hi @eyurtsev

Could you take a look at this, please?

@AhmedTammaa AhmedTammaa marked this pull request as draft November 8, 2024 20:00
@eyurtsev
Copy link
Collaborator

@AhmedTammaa this is a great change -- it'll make it much easier to understand what's going on.

Given that you're already taken a deep dive into this code, would you be able to help define the precise semantics for what the splitter does?

Based on the code:

class HTMLHeaderTextSplitter:
    """
    Splitting HTML files based on specified headers.
    Requires lxml package.
    """
    def __init__(
        self,
        headers_to_split_on: List[Tuple[str, str]],
        return_each_element: bool = False,
    ):
        """Create a new HTMLHeaderTextSplitter.
        Args:
            headers_to_split_on: list of tuples of headers we want to track mapped to
                (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4,
                h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2)].
            return_each_element: Return each element w/ associated headers.
        """

It's a bit hard to understand from the doc-string, but what's the actual of each document chunk?

  • What does the content for a given header contain. (Does the h1 header contain the h2 header is we're splitting on h1 and h2)
  • Can we split only on h2, but not h1?
  • what metadata does each document contain?

@eyurtsev eyurtsev self-requested a review December 11, 2024 23:27
@AhmedTammaa
Copy link
Contributor Author

Hi @eyurtsev

Thanks for the feedback! Here’s what’s going on:

  • The splitter takes an HTML page and creates Document objects, each representing a chunk of text under a certain “header context.”
  • Before the first header appears, any text is grouped into a “pre-header” chunk with no metadata.
  • Once we hit a header (like <h1>, <h2>, etc.), we finalize the previous chunk and start a new one. The headers you choose in headers_to_split_on determine what levels we actually split on and what gets stored in the metadata. For example:
    • [("h1", "Header 1"), ("h2", "Header 2")] means <h1> text goes into "Header 1" in metadata and <h2> text goes into "Header 2".
  • If you only pick h2 and h3 to split on, the code ignores h1 for splitting and metadata. It just won’t create a new chunk at h1.
  • Each Document’s page_content is basically the combined text of <p> elements (and whatever tags you specify) under the current headers. Its metadata is a dict of header levels (e.g. {"Header 1": "Some Title", "Header 2": "Some Subsection"}).

I have also modified the provided code and made further tests:

I prepared this small notebook on Google Colab to show you use cases old vs new code

It is public and anyone with the link can see it.

The last experiment shows the main reason for change where the old class fails with larger files

I tested it on a generated HTML file

If you have any more questions please let me know.
Thanks for your effort in Langchain it makes my life easier and that's why I am eager to contribute to it :)

@eyurtsev
Copy link
Collaborator

@AhmedTammaa OK this makes sense and will be very happy to merge the code. if we can do the following:

  1. Add a unit test for at least one, ideally two more test cases:

A) split on h1 only (but document contains h2 / h3 tags)
B) split on h1 only (but document does not contain any header tags)

The existing unit tests are here: https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/tests/unit_tests/test_text_splitters.py#L1638-L1638

You can use pytest.parameterize() and I'd suggest just asking chatgpt (or your favorite chat model) to write it. It'll help make sure that the semantics are well defined.

  1. Address the issue w/ the imports (flagged via a comment)

  2. If you're able improve the doc-string for the API reference (here too you can probably ask your favorite chat model to update the API reference using google style doc-string and the content of your explanation + test cases)

@AhmedTammaa
Copy link
Contributor Author

Alright @eyurtsev

I will be working on that. I will do a lot of testing before pushing. I am currently studying the old behaviour deeply, so I can reproduce it (ideally, exactly the same) or with minor changes.

The way I will be testing is asking an llm to generate some sample documents and I compare old output to new output which should be the same (at the moment it isn't).

Or,

There can be more flexibility?

@eyurtsev
Copy link
Collaborator

I compare old output to new output which should be the same (at the moment it isn't).

Likely OK if the output isn't the same. Based on how the splitter is supposed to work, it sounds like we can well define what the behavior should be for each of the test cases.

So let's create some simple test cases, and verify that the output we get is as expected.

@AhmedTammaa
Copy link
Contributor Author

AhmedTammaa commented Dec 16, 2024

Hi @eyurtsev,

Thanks for answering my query.

Here is a summary of the behaviour and the added test cases.

Summary of HTMLHeaderTextSplitter Behavior

  1. Document Chunk Content:

    • Headers: Each specified header (e.g., <h1>, <h2>) becomes a separate Document containing the header text.
    • Content: Text following a header up to the next header of the same or higher level is grouped into a Document. When splitting on multiple headers (e.g., <h1> and <h2>), higher-level headers do not contain lower-level ones; instead, metadata reflects the hierarchy.
  2. Splitting Only on Specific Headers:

    • You can choose to split on any combination of headers. For example, splitting only on <h2> while ignoring <h1> is supported by specifying only ("h2", "Header 2") in headers_to_split_on.
  3. Metadata in Each Document:

    • Structure: Each Document includes metadata mapping the specified headers to their respective texts.
    • Hierarchy: For nested headers, metadata accumulates to reflect the document structure. For example, a <h2> under a <h1> will have both Header 1 and Header 2 in its metadata.

Test Cases Overview

  • Test Case A:
    Splits on <h1>, <h2>, and <h3> within a nested structure, ensuring each header and its content are correctly segmented with appropriate metadata.

  • Test Case B:
    Splits on <h1> only in a document that contains no headers, resulting in a single aggregated Document without metadata.

  • Added Test Cases
    I have added edge cases where elements are on different depths, nested structure and etc.
    Here is the full test set now:

Test Code
@pytest.fixture
def html_header_splitter_splitter_factory() -> HTMLHeaderTextSplitter:
    """
    Fixture to create an HTMLHeaderTextSplitter instance with given headers.
    This factory allows dynamic creation of splitters with different headers.
    """
    def _create_splitter(headers_to_split_on: List[Tuple[str, str]]) -> HTMLHeaderTextSplitter:
        return HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    return _create_splitter


@pytest.mark.parametrize(
    "headers_to_split_on, html_input, expected_documents, test_case",
    [
        (
            # Test Case 1: Split on h1 and h2
            [("h1", "Header 1"), ("h2", "Header 2")],
            """
            <html>
                <body>
                    <h1>Introduction</h1>
                    <p>This is the introduction.</p>
                    <h2>Background</h2>
                    <p>Background information.</p>
                    <h1>Conclusion</h1>
                    <p>Final thoughts.</p>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Introduction",
                    metadata={"Header 1": "Introduction"}
                ),
                Document(
                    page_content="This is the introduction.",
                    metadata={"Header 1": "Introduction"}
                ),
                Document(
                    page_content="Background",
                    metadata={
                        "Header 1": "Introduction",
                        "Header 2": "Background"
                    }
                ),
                Document(
                    page_content="Background information.",
                    metadata={
                        "Header 1": "Introduction",
                        "Header 2": "Background"
                    }
                ),
                Document(
                    page_content="Conclusion",
                    metadata={"Header 1": "Conclusion"}
                ),
                Document(
                    page_content="Final thoughts.",
                    metadata={"Header 1": "Conclusion"}
                )
            ],
            "Simple headers and paragraphs"
        ),
        (
            # Test Case 2: Nested headers with h1, h2, and h3
            [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")],
            """
            <html>
                <body>
                    <div>
                        <h1>Main Title</h1>
                        <div>
                            <h2>Subsection</h2>
                            <p>Details of subsection.</p>
                            <div>
                                <h3>Sub-subsection</h3>
                                <p>More details.</p>
                            </div>
                        </div>
                    </div>
                    <h1>Another Main Title</h1>
                    <p>Content under another main title.</p>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Main Title",
                    metadata={"Header 1": "Main Title"}
                ),
                Document(
                    page_content="Subsection",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection"
                    }
                ),
                Document(
                    page_content="Details of subsection.",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection"
                    }
                ),
                Document(
                    page_content="Sub-subsection",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection",
                        "Header 3": "Sub-subsection"
                    }
                ),
                Document(
                    page_content="More details.",
                    metadata={
                        "Header 1": "Main Title",
                        "Header 2": "Subsection",
                        "Header 3": "Sub-subsection"
                    }
                ),
                Document(
                    page_content="Another Main Title",
                    metadata={"Header 1": "Another Main Title"}
                ),
                Document(
                    page_content="Content under another main title.",
                    metadata={"Header 1": "Another Main Title"}
                )
            ],
            "Nested headers with h1, h2, and h3"
        ),
        (
            # Test Case 3: No headers
            [("h1", "Header 1")],
            """
            <html>
                <body>
                    <p>Paragraph one.</p>
                    <p>Paragraph two.</p>
                    <div>
                        <p>Paragraph three.</p>
                    </div>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Paragraph one.  \nParagraph two.  \nParagraph three.",
                    metadata={}
                )
            ],
            "No headers present"
        ),
        (
            # Test Case 4: Multiple headers of the same level
            [("h1", "Header 1")],
            """
            <html>
                <body>
                    <h1>Chapter 1</h1>
                    <p>Content of chapter 1.</p>
                    <h1>Chapter 2</h1>
                    <p>Content of chapter 2.</p>
                    <h1>Chapter 3</h1>
                    <p>Content of chapter 3.</p>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Chapter 1",
                    metadata={"Header 1": "Chapter 1"}
                ),
                Document(
                    page_content="Content of chapter 1.",
                    metadata={"Header 1": "Chapter 1"}
                ),
                Document(
                    page_content="Chapter 2",
                    metadata={"Header 1": "Chapter 2"}
                ),
                Document(
                    page_content="Content of chapter 2.",
                    metadata={"Header 1": "Chapter 2"}
                ),
                Document(
                    page_content="Chapter 3",
                    metadata={"Header 1": "Chapter 3"}
                ),
                Document(
                    page_content="Content of chapter 3.",
                    metadata={"Header 1": "Chapter 3"}
                )
            ],
            "Multiple headers of the same level"
        ),
        (
            # Test Case 5: Headers with no content
            [("h1", "Header 1"), ("h2", "Header 2")],
            """
            <html>
                <body>
                    <h1>Header 1</h1>
                    <h2>Header 2</h2>
                    <h1>Header 3</h1>
                </body>
            </html>
            """,
            [
                Document(
                    page_content="Header 1",
                    metadata={"Header 1": "Header 1"}
                ),
                Document(
                    page_content="Header 2",
                    metadata={
                        "Header 1": "Header 1",
                        "Header 2": "Header 2"
                    }
                ),
                Document(
                    page_content="Header 3",
                    metadata={"Header 1": "Header 3"}
                )
            ],
            "Headers with no associated content"
        ),
    ]
)
def test_html_header_text_splitter(

    html_header_splitter_splitter_factory: Any,
    headers_to_split_on: List[Tuple[str, str]],
    html_input: str,
    expected_documents: List[Document],
    test_case: str
):
    """
    Test the HTML header text splitter.

    Args:
        html_header_splitter_splitter_factory (Any): Factory function to create
            the HTML header splitter.
        headers_to_split_on (List[Tuple[str, str]]): List of headers to split on.
        html_input (str): The HTML input string to be split.
        expected_documents (List[Document]): List of expected Document objects.
        test_case (str): Description of the test case.

    Raises:
        AssertionError: If the number of documents or their content/metadata
            does not match the expected values.
    """

    splitter = html_header_splitter_splitter_factory(headers_to_split_on=headers_to_split_on)
    docs = splitter.split_text(html_input)

    assert len(docs) == len(expected_documents), (
        f"Test Case '{test_case}' Failed: Number of documents mismatch. "
        f"Expected {len(expected_documents)}, got {len(docs)}."
    )
    for idx, (doc, expected) in enumerate(zip(docs, expected_documents), start=1):
        assert doc.page_content == expected.page_content, (
            f"Test Case '{test_case}' Failed at Document {idx}: "
            f"Content mismatch.\nExpected: {expected.page_content}\nGot: {doc.page_content}"
        )
        assert doc.metadata == expected.metadata, (
            f"Test Case '{test_case}' Failed at Document {idx}: "
            f"Metadata mismatch.\nExpected: {expected.metadata}\nGot: {doc.metadata}"
        )


@pytest.mark.parametrize(
    "headers_to_split_on, html_content, expected_output, test_case",
    [
        (
            # Test Case A: Split on h1 and h2 with h3 in content
            [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")],
            """
            <!DOCTYPE html>
            <html>
            <body>
                <div>
                    <h1>Foo</h1>
                    <p>Some intro text about Foo.</p>
                    <div>
                        <h2>Bar main section</h2>
                        <p>Some intro text about Bar.</p>
                        <h3>Bar subsection 1</h3>
                        <p>Some text about the first subtopic of Bar.</p>
                        <h3>Bar subsection 2</h3>
                        <p>Some text about the second subtopic of Bar.</p>
                    </div>
                    <div>
                        <h2>Baz</h2>
                        <p>Some text about Baz</p>
                    </div>
                    <br>
                    <p>Some concluding text about Foo</p>
                </div>
            </body>
            </html>
            """,
            [
                Document(
                    metadata={'Header 1': 'Foo'},
                    page_content='Foo'
                ),
                Document(
                    metadata={'Header 1': 'Foo'},
                    page_content='Some intro text about Foo.'
                ),
                Document(
                    metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'},
                    page_content='Bar main section'
                ),
                Document(
                    metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'},
                    page_content='Some intro text about Bar.'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 1'
                    },
                    page_content='Bar subsection 1'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 1'
                    },
                    page_content='Some text about the first subtopic of Bar.'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 2'
                    },
                    page_content='Bar subsection 2'
                ),
                Document(
                    metadata={
                        'Header 1': 'Foo',
                        'Header 2': 'Bar main section',
                        'Header 3': 'Bar subsection 2'
                    },
                    page_content='Some text about the second subtopic of Bar.'
                ),
                Document(
                    metadata={'Header 1': 'Foo', 'Header 2': 'Baz'},
                    page_content='Baz'
                ),
                Document(
                    metadata={'Header 1': 'Foo'},
                    page_content='Some text about Baz  \nSome concluding text about Foo'
                )
            ],
            "Test Case A: Split on h1, h2, and h3 with nested headers"
        ),
        (
            # Test Case B: Split on h1 only without any headers
            [("h1", "Header 1")],
            """
            <html>
                <body>
                    <p>Paragraph one.</p>
                    <p>Paragraph two.</p>
                    <p>Paragraph three.</p>
                </body>
            </html>
            """,
            [
                Document(
                    metadata={},
                    page_content='Paragraph one.  \nParagraph two.  \nParagraph three.'
                )
            ],
            "Test Case B: Split on h1 only without any headers"
        )
    ]
)
def test_additional_html_header_text_splitter(
    html_header_splitter_splitter_factory: Any,
    headers_to_split_on: List[Tuple[str, str]],
    html_content: str,
    expected_output: List[Document],
    test_case: str
):
    """
    Test the HTML header text splitter.

    Args:
        html_header_splitter_splitter_factory (Any): Factory function to create
            the HTML header splitter.
        headers_to_split_on (List[Tuple[str, str]]): List of headers to split on.
        html_content (str): HTML content to be split.
        expected_output (List[Document]): Expected list of Document objects.
        test_case (str): Description of the test case.

    Raises:
        AssertionError: If the number of documents or their content/metadata
            does not match the expected output.
    """
    splitter = html_header_splitter_splitter_factory(headers_to_split_on=headers_to_split_on)
    docs = splitter.split_text(html_content)



    assert len(docs) == len(expected_output), (
        f"{test_case} Failed: Number of documents mismatch. "
        f"Expected {len(expected_output)}, got {len(docs)}."
    )
    for idx, (doc, expected) in enumerate(zip(docs, expected_output), start=1):
        assert doc.page_content == expected.page_content, (
            f"{test_case} Failed at Document {idx}: "
            f"Content mismatch.\nExpected: {expected.page_content}\nGot: {doc.page_content}"
        )
        assert doc.metadata == expected.metadata, (
            f"{test_case} Failed at Document {idx}: "
            f"Metadata mismatch.\nExpected: {expected.metadata}\nGot: {doc.metadata}"
        )


@pytest.mark.parametrize(
    "headers_to_split_on, html_content, expected_output, test_case",
    [
        (
            # Test Case C: Split on h1, h2, and h3 with no headers present
            [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")],
            """
            <html>
                <body>
                    <p>Just some random text without headers.</p>
                    <div>
                        <span>More text here.</span>
                    </div>
                </body>
            </html>
            """,
            [
                Document(
                    page_content='Just some random text without headers.  \nMore text here.',
                    metadata={}
                )
            ],
            "Test Case C: Split on h1, h2, and h3 without any headers"
        )
    ]
)
def test_no_headers_with_multiple_splitters(
    html_header_splitter_splitter_factory: Any,
    headers_to_split_on: List[Tuple[str, str]],
    html_content: str,
    expected_output: List[Document],
    test_case: str
):
    """
    Test HTML content splitting without headers using multiple splitters.
    Args:
        html_header_splitter_splitter_factory (Any): Factory to create the
            HTML header splitter.
        headers_to_split_on (List[Tuple[str, str]]): List of headers to split on.
        html_content (str): HTML content to be split.
        expected_output (List[Document]): Expected list of Document objects
            after splitting.
        test_case (str): Description of the test case.
    Raises:
        AssertionError: If the number of documents or their content/metadata
            does not match the expected output.
    """
    splitter = html_header_splitter_splitter_factory(headers_to_split_on=headers_to_split_on)
    docs = splitter.split_text(html_content)



    assert len(docs) == len(expected_output), (
        f"{test_case} Failed: Number of documents mismatch. "
        f"Expected {len(expected_output)}, got {len(docs)}."
    )
    for idx, (doc, expected) in enumerate(zip(docs, expected_output), start=1):
        assert doc.page_content == expected.page_content, (
            f"{test_case} Failed at Document {idx}: "
            f"Content mismatch.\nExpected: {expected.page_content}\nGot: {doc.page_content}"
        )
        assert doc.metadata == expected.metadata, (
            f"{test_case} Failed at Document {idx}: "
            f"Metadata mismatch.\nExpected: {expected.metadata}\nGot: {doc.metadata}"
        )
Currently, I have the class is stable and passes all of them. I am doing code-cleaning and doing performance optimization.

The current class also deals with the document as a "tree" and no recursion so we can avoid future problems like the recursion limit exceeded.

Rewrote the HTMLHeaderSplitter with BS4 to support large files
added extra tests for the new HTML class
@AhmedTammaa
Copy link
Contributor Author

AhmedTammaa commented Dec 16, 2024

@eyurtsev So I have committed the new updates. Please let me know if there are further changes we need to make.

@AhmedTammaa AhmedTammaa marked this pull request as ready for review December 17, 2024 09:40
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. langchain Related to the langchain package and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Dec 17, 2024
@AhmedTammaa
Copy link
Contributor Author

Hi @eyurtsev, I think you can have a look now. I have made the required changes. I figured out how to make the dependency optional by looking at other scripts in the repo. I made the documentation much clearer.

I have learnt a lot from this one. It is my first contribution to a project on such a scale, and I hope it gets merged. I am eager to contribute more in the future with the experience I have gained from this one.

Lastly, if we require any more changes please let me know.

""",
[
Document(
page_content="Introduction", metadata={"Header 1": "Introduction"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headers: Each specified header (e.g.,

,

) becomes a separate Document containing the header text.

This is different from the behavior before. Which situation is this useful for? Is this for a case where an HTML document only contains the

header and no other content inside the header?


How would this be used downstream?

Should we include a boolean flag to toggle this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was checking old behaviour. Sometimes it create a document contains only the header text and then the next one contains the whats in between. So, I wrote it to be consistently creating a document contains the header text and then next one contains the the content inside. There is no strong reason for doing it like that. So, I can change it.

libs/text-splitters/langchain_text_splitters/html.py Outdated Show resolved Hide resolved
libs/text-splitters/langchain_text_splitters/html.py Outdated Show resolved Hide resolved
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Dec 21, 2024
@AhmedTammaa
Copy link
Contributor Author

@eyurtsev I have simplified the implementation following your suggestions. Now,

  • No nodes or data class for them.
  • No tree building, instead, splitting on the fly.
  • The overall code is much more maintainable than before.

Please let me know if we need to do further modification.

@eyurtsev
Copy link
Collaborator

eyurtsev commented Jan 8, 2025

I @AhmedTammaa sorry for the delay will review in a bit -- I was out for holidays for a couple of weeks

@AhmedTammaa
Copy link
Contributor Author

hi @eyurtsev.

I know that you are probably having a lot of code to review since the last break. This is just a gentle reminder for this PR :)

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jan 20, 2025

# document transformation for "structure-aware" chunking is handled with xsl.
# see comments in html_chunks_with_headers.xslt for more detailed information.
xslt_path = pathlib.Path(__file__).parent / "xsl/html_chunks_with_headers.xslt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably delete this file as well (can be done in a separate PR)

with open(file, "r", encoding="utf-8") as f:
html_content = f.read()
else:
html_content = file.read()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks like it's working based on unit tests, so I think we're good.

I suspect that there's a way to further simplify, I would've tried the following:

  1. create a private method that accepts html source and yields a generator over documents
  2. the method itself would use a queue to do a tree traversal and have some variable to keep track of dom path and current content associated with the given dom path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will try to enhance it even further and create a new PR. The new PR will also delete the file you suggested be removed.

Thank you very much for your invaluable review of my code. I definitely learnt, and I am eager to contribute more to Langchain!!

@eyurtsev eyurtsev merged commit d3ed9b8 into langchain-ai:master Jan 20, 2025
45 checks passed
ccurme pushed a commit that referenced this pull request Jan 23, 2025
…29340)

This pull request removes the now-unused html_chunks_with_headers.xslt
file from the codebase. In a previous update ([PR
#27678](#27678)), the
HTMLHeaderTextSplitter class was refactored to utilize BeautifulSoup
instead of lxml and XSLT for HTML processing. As a result, the
html_chunks_with_headers.xslt file is no longer necessary and can be
safely deleted to maintain code cleanliness and reduce potential
confusion.

Issue: N/A

Dependencies: N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature langchain Related to the langchain package lgtm PR looks good. Use to confirm that a PR is ready for merging. size:XL This PR changes 500-999 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

HTMLHeaderTextSplitter won't run (maxHead)
2 participants