-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270
Comments
Hey @benbenz, good to see you back with another intriguing puzzle for us to solve! Hope you've been doing well. Thank you for bringing this issue to our attention. It seems like you're encountering a problem where the Firstly, you can implement a mechanism to check the size of the chunks against the maximum input size of your embedding model before they are processed. If a chunk exceeds this size, you can further split the chunk into smaller parts. However, it's crucial to maintain the character indexes of each chunk relative to the original content. Here's an approach to achieve this:
Here's a code snippet illustrating how you might implement these steps in the context of your def ensure_chunk_size(self, sentences, max_size):
"""
Ensure that each chunk is within the maximum size limit.
If a chunk exceeds the limit, it is further split.
This function returns a list of chunks that are all within the size limit.
"""
processed_chunks = []
for sentence in sentences:
combined_sentence = sentence["combined_sentence"]
if len(combined_sentence) > max_size:
# Further split the combined_sentence into smaller parts
parts = self.split_to_fit(combined_sentence, max_size)
processed_chunks.extend(parts)
else:
processed_chunks.append(sentence)
return processed_chunks
def split_to_fit(self, text, max_size):
"""
Split the text into parts where each part is within the max_size limit.
This is a simplistic approach and can be replaced with a more sophisticated method.
"""
parts = []
current_part = ""
for sentence in text.split('. '):
if len(current_part) + len(sentence) + 1 > max_size:
parts.append({"sentence": current_part, "combined_sentence": current_part})
current_part = sentence
else:
current_part += (". " + sentence).strip()
if current_part:
parts.append({"sentence": current_part, "combined_sentence": current_part})
return parts
# Modify the test_split_and_permutated function or wherever the splitting logic is applied
# to include a call to ensure_chunk_size, passing the maximum size your embedding model can handle.
# For example:
max_embedding_size = 512 # Assuming the embedding model can handle up to 512 characters
sentences = node_parser._build_sentence_groups(text_splits)
sentences_within_limit = node_parser.ensure_chunk_size(sentences, max_embedding_size) This approach ensures that each chunk processed by the You can find more information about this in the LlamaIndex codebase. I hope this helps! If you have any further questions, please don't hesitate to ask. Sources
|
The only alternative is chaining with a second splitter. I'm not really sure what else you could do Not sure what you meant by "but it caused the characters indexes to be lost" -- if the chunk is too big, indeed it will be split, and might lose some context. |
I had tried to chain it with the SentenceSplitterNodeParser and it worked as far as limiting chunks
The issue was that the character offset (start_idx) is computed in relation to each node coming out of the semantic parser, not the original document. So we basically loose the character offset in relation to each Document, sadly |
I ended up impletting a safety net with this class: from typing import Sequence, Any, List
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
)
import logging
from llama_index.core.schema import BaseNode, Document , ObjectType , TextNode
from llama_index.core.constants import DEFAULT_CHUNK_SIZE
from llama_index.core.node_parser.text.sentence import SENTENCE_CHUNK_OVERLAP
class SafeSemanticSplitter(SemanticSplitterNodeParser):
safety_chunker : SentenceSplitter = SentenceSplitter(chunk_size=DEFAULT_CHUNK_SIZE*4,chunk_overlap=SENTENCE_CHUNK_OVERLAP)
def _parse_nodes(
self,
nodes: Sequence[BaseNode],
show_progress: bool = False,
**kwargs: Any,
) -> List[BaseNode]:
all_nodes : List[BaseNode] = super()._parse_nodes(nodes=nodes,show_progress=show_progress,**kwargs)
all_good = True
for node in all_nodes:
if node.get_type()==ObjectType.TEXT:
node:TextNode=node
if self.safety_chunker._token_size(node.text)>self.safety_chunker.chunk_size:
logging.info("Chunk size too big after semantic chunking: switching to static chunking")
all_good = False
break
if not all_good:
all_nodes = self.safety_chunker._parse_nodes(nodes,show_progress=show_progress,**kwargs)
return all_nodes |
this worked for me: from llama_index.core.node_parser import SemanticSplitterNodeParser, SentenceSplitter
from openai import BadRequestError
unsafe_splitter = SemanticSplitterNodeParser(
buffer_size=2,
breakpoint_percentile_threshold=75,
embed_model=Settings.embed_model,
show_progress=True,
include_metadata=True,
)
safe_splitter = SentenceSplitter(
chunk_size=256,
chunk_overlap=32,
include_metadata=True,
)
all_nodes = []
documents_count = len(documents)
for i, document in enumerate(documents):
print(f"Processing document {i} of {documents_count}.")
nodes = []
try:
nodes = unsafe_splitter.get_nodes_from_documents([document])
except BadRequestError:
print("Parsing error: openai bad request. Parse by safe splitter.")
nodes = safe_splitter.get_nodes_from_documents([document])
all_nodes.extend(nodes) |
Bug Description
SemanticSplitterNodeParser is producing nodes with chunks that are too large for the embedding model. It would be nice if it had a security maximum length to avoid that. This occurs especially on text data that is structured like a large indexing file listing file names and their metadata for exemple (this is how I triggered the error). I tried to chain it with a SentenceSplitterNodeParser as a security layer but it caused the characters indexes to be lost, in relation to the original content.
Maybe I am not using it correctly or there is a node processor that could come to the rescue?
Version
0.10.19
Steps to Reproduce
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: