-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
langchain: Replace lxml and XSLT with BeautifulSoup in HTMLHeaderTextSplitter for Improved Large HTML File Processing #27678
Conversation
used bs4 for larger html file processing
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
Hi @eyurtsev Could you take a look at this, please? |
updated according to linter tests
@AhmedTammaa this is a great change -- it'll make it much easier to understand what's going on. Given that you're already taken a deep dive into this code, would you be able to help define the precise semantics for what the splitter does? Based on the code: class HTMLHeaderTextSplitter:
"""
Splitting HTML files based on specified headers.
Requires lxml package.
"""
def __init__(
self,
headers_to_split_on: List[Tuple[str, str]],
return_each_element: bool = False,
):
"""Create a new HTMLHeaderTextSplitter.
Args:
headers_to_split_on: list of tuples of headers we want to track mapped to
(arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4,
h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2)].
return_each_element: Return each element w/ associated headers.
""" It's a bit hard to understand from the doc-string, but what's the actual of each document chunk?
|
Hi @eyurtsev Thanks for the feedback! Here’s what’s going on:
I have also modified the provided code and made further tests:I prepared this small notebook on Google Colab to show you use cases old vs new code It is public and anyone with the link can see it. The last experiment shows the main reason for change where the old class fails with larger filesI tested it on a generated HTML file If you have any more questions please let me know. |
@AhmedTammaa OK this makes sense and will be very happy to merge the code. if we can do the following:
A) split on h1 only (but document contains h2 / h3 tags) The existing unit tests are here: https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/tests/unit_tests/test_text_splitters.py#L1638-L1638 You can use
|
Alright @eyurtsev I will be working on that. I will do a lot of testing before pushing. I am currently studying the old behaviour deeply, so I can reproduce it (ideally, exactly the same) or with minor changes. The way I will be testing is asking an llm to generate some sample documents and I compare old output to new output which should be the same (at the moment it isn't). Or, There can be more flexibility? |
Likely OK if the output isn't the same. Based on how the splitter is supposed to work, it sounds like we can well define what the behavior should be for each of the test cases. So let's create some simple test cases, and verify that the output we get is as expected. |
Hi @eyurtsev, Thanks for answering my query. Here is a summary of the behaviour and the added test cases. Summary of
|
Rewrote the HTMLHeaderSplitter with BS4 to support large files
added extra tests for the new HTML class
@eyurtsev So I have committed the new updates. Please let me know if there are further changes we need to make. |
Hi @eyurtsev, I think you can have a look now. I have made the required changes. I figured out how to make the dependency optional by looking at other scripts in the repo. I made the documentation much clearer. I have learnt a lot from this one. It is my first contribution to a project on such a scale, and I hope it gets merged. I am eager to contribute more in the future with the experience I have gained from this one. Lastly, if we require any more changes please let me know. |
""", | ||
[ | ||
Document( | ||
page_content="Introduction", metadata={"Header 1": "Introduction"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Headers: Each specified header (e.g.,
,
) becomes a separate Document containing the header text.
This is different from the behavior before. Which situation is this useful for? Is this for a case where an HTML document only contains the
header and no other content inside the header?
How would this be used downstream?
Should we include a boolean flag to toggle this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was checking old behaviour. Sometimes it create a document contains only the header text and then the next one contains the whats in between. So, I wrote it to be consistently creating a document contains the header text and then next one contains the the content inside. There is no strong reason for doing it like that. So, I can change it.
@eyurtsev I have simplified the implementation following your suggestions. Now,
Please let me know if we need to do further modification. |
I @AhmedTammaa sorry for the delay will review in a bit -- I was out for holidays for a couple of weeks |
hi @eyurtsev. I know that you are probably having a lot of code to review since the last break. This is just a gentle reminder for this PR :) |
|
||
# document transformation for "structure-aware" chunking is handled with xsl. | ||
# see comments in html_chunks_with_headers.xslt for more detailed information. | ||
xslt_path = pathlib.Path(__file__).parent / "xsl/html_chunks_with_headers.xslt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably delete this file as well (can be done in a separate PR)
with open(file, "r", encoding="utf-8") as f: | ||
html_content = f.read() | ||
else: | ||
html_content = file.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks like it's working based on unit tests, so I think we're good.
I suspect that there's a way to further simplify, I would've tried the following:
- create a private method that accepts html source and yields a generator over documents
- the method itself would use a queue to do a tree traversal and have some variable to keep track of dom path and current content associated with the given dom path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will try to enhance it even further and create a new PR. The new PR will also delete the file you suggested be removed.
Thank you very much for your invaluable review of my code. I definitely learnt, and I am eager to contribute more to Langchain!!
…29340) This pull request removes the now-unused html_chunks_with_headers.xslt file from the codebase. In a previous update ([PR #27678](#27678)), the HTMLHeaderTextSplitter class was refactored to utilize BeautifulSoup instead of lxml and XSLT for HTML processing. As a result, the html_chunks_with_headers.xslt file is no longer necessary and can be safely deleted to maintain code cleanliness and reduce potential confusion. Issue: N/A Dependencies: N/A
This pull request updates the
HTMLHeaderTextSplitter
by replacing thesplit_text_from_file
method's implementation. The original method usedlxml
and XSLT for processing HTML files, which causedlxml.etree.xsltapplyerror maxhead
when handling large HTML documents due to limitations in the XSLT processor. Fixes #13149By switching to BeautifulSoup (
bs4
), we achieve:lxml
and XSLT.lxml
and external XSLT files, relying instead on the widely usedbeautifulsoup4
library.Issue:
This change addresses issues related to processing large HTML files with the existing
HTMLHeaderTextSplitter
implementation. It resolves problems where users encounter lxml.etree.xsltapplyerror maxhead due to large HTML documents.Dependencies:
beautifulsoup4
): Thebeautifulsoup4
library is now used for parsing HTML content.pip install beautifulsoup4
Code Changes:
Updated the
split_text_from_file
method inHTMLHeaderTextSplitter
as follows: