Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

Closed
benbenz opened this issue Mar 26, 2024 · 5 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@benbenz
Copy link

benbenz commented Mar 26, 2024

Bug Description

SemanticSplitterNodeParser is producing nodes with chunks that are too large for the embedding model. It would be nice if it had a security maximum length to avoid that. This occurs especially on text data that is structured like a large indexing file listing file names and their metadata for exemple (this is how I triggered the error). I tried to chain it with a SentenceSplitterNodeParser as a security layer but it caused the characters indexes to be lost, in relation to the original content.
Maybe I am not using it correctly or there is a node processor that could come to the rescue?

Version

0.10.19

Steps to Reproduce

  • add a large index file in the format
File: /local/pathtofile1
Création date: 2024-01-15
Modified date: 2024-02-01 
Modified date time: 2024-02-01 14:03:60.000
File size: 230
File type: doc

File: /local/pathtofile2
Création date: 2024-01-15
Modified date: 2024-02-01 
Modified date time: 2024-02-01 14:03:60.000
File size: 230
File type: doc

....
  • use the SemanticSplitterNodeParser for indexing (cf semantic node parser page)

Relevant Logs/Tracbacks

No response

@benbenz benbenz added bug Something isn't working triage Issue needs to be triaged/prioritized labels Mar 26, 2024
Copy link

dosubot bot commented Mar 26, 2024

Hey @benbenz, good to see you back with another intriguing puzzle for us to solve! Hope you've been doing well.

Thank you for bringing this issue to our attention. It seems like you're encountering a problem where the SemanticSplitterNodeParser is producing chunks that are too large for the embedding model, especially when processing large indexing files. This is a known issue and there are a few ways to handle it.

Firstly, you can implement a mechanism to check the size of the chunks against the maximum input size of your embedding model before they are processed. If a chunk exceeds this size, you can further split the chunk into smaller parts. However, it's crucial to maintain the character indexes of each chunk relative to the original content.

Here's an approach to achieve this:

  1. Determine the maximum input size (in terms of characters or tokens) that your embedding model can handle.
  2. Modify the SemanticSplitterNodeParser to include a check for the size of the chunks against this maximum size before processing them.
  3. If a chunk exceeds the maximum size, split it into smaller parts, ensuring that each part is within the size limit. Implement a method to split the chunks efficiently while preserving their semantic integrity as much as possible.
  4. Keep track of the character indexes for each chunk in relation to the original document. This can be done by maintaining a running total of characters processed and adjusting the indexes of subsequent chunks accordingly.

Here's a code snippet illustrating how you might implement these steps in the context of your SemanticSplitterNodeParser:

def ensure_chunk_size(self, sentences, max_size):
    """
    Ensure that each chunk is within the maximum size limit.
    If a chunk exceeds the limit, it is further split.
    This function returns a list of chunks that are all within the size limit.
    """
    processed_chunks = []
    for sentence in sentences:
        combined_sentence = sentence["combined_sentence"]
        if len(combined_sentence) > max_size:
            # Further split the combined_sentence into smaller parts
            parts = self.split_to_fit(combined_sentence, max_size)
            processed_chunks.extend(parts)
        else:
            processed_chunks.append(sentence)
    return processed_chunks

def split_to_fit(self, text, max_size):
    """
    Split the text into parts where each part is within the max_size limit.
    This is a simplistic approach and can be replaced with a more sophisticated method.
    """
    parts = []
    current_part = ""
    for sentence in text.split('. '):
        if len(current_part) + len(sentence) + 1 > max_size:
            parts.append({"sentence": current_part, "combined_sentence": current_part})
            current_part = sentence
        else:
            current_part += (". " + sentence).strip()
    if current_part:
        parts.append({"sentence": current_part, "combined_sentence": current_part})
    return parts

# Modify the test_split_and_permutated function or wherever the splitting logic is applied
# to include a call to ensure_chunk_size, passing the maximum size your embedding model can handle.
# For example:
max_embedding_size = 512  # Assuming the embedding model can handle up to 512 characters
sentences = node_parser._build_sentence_groups(text_splits)
sentences_within_limit = node_parser.ensure_chunk_size(sentences, max_embedding_size)

This approach ensures that each chunk processed by the SemanticSplitterNodeParser is within the size limit of the embedding model, and it maintains the character indexes by adjusting the splitting logic to keep track of the sizes of the chunks. Remember, the split_to_fit method provided here is quite basic and may need to be refined to better preserve the semantic integrity of the text.

You can find more information about this in the LlamaIndex codebase.

I hope this helps! If you have any further questions, please don't hesitate to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

The only alternative is chaining with a second splitter. I'm not really sure what else you could do

Not sure what you meant by "but it caused the characters indexes to be lost" -- if the chunk is too big, indeed it will be split, and might lose some context.

@benbenz
Copy link
Author

benbenz commented Mar 26, 2024

The only alternative is chaining with a second splitter. I'm not really sure what else you could do

I had tried to chain it with the SentenceSplitterNodeParser and it worked as far as limiting chunks

Not sure what you meant by "but it caused the characters indexes to be lost" -- if the chunk is too big, indeed it will be split, and might lose some context.

The issue was that the character offset (start_idx) is computed in relation to each node coming out of the semantic parser, not the original document. So we basically loose the character offset in relation to each Document, sadly

@benbenz
Copy link
Author

benbenz commented Mar 27, 2024

I ended up impletting a safety net with this class:

from typing import Sequence, Any, List
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
import logging
from llama_index.core.schema import BaseNode, Document , ObjectType , TextNode
from llama_index.core.constants import DEFAULT_CHUNK_SIZE
from llama_index.core.node_parser.text.sentence import SENTENCE_CHUNK_OVERLAP

class SafeSemanticSplitter(SemanticSplitterNodeParser):

    safety_chunker : SentenceSplitter = SentenceSplitter(chunk_size=DEFAULT_CHUNK_SIZE*4,chunk_overlap=SENTENCE_CHUNK_OVERLAP)

    def _parse_nodes(
        self,
        nodes: Sequence[BaseNode],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> List[BaseNode]:
        all_nodes : List[BaseNode] = super()._parse_nodes(nodes=nodes,show_progress=show_progress,**kwargs)
        all_good = True
        for node in all_nodes:
            if node.get_type()==ObjectType.TEXT:
                node:TextNode=node
                if self.safety_chunker._token_size(node.text)>self.safety_chunker.chunk_size:
                    logging.info("Chunk size too big after semantic chunking: switching to static chunking")
                    all_good = False
                    break
        if not all_good:
            all_nodes = self.safety_chunker._parse_nodes(nodes,show_progress=show_progress,**kwargs)
        return all_nodes    

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 26, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 3, 2024
@arxor
Copy link

arxor commented Aug 21, 2024

this worked for me:

from llama_index.core.node_parser import SemanticSplitterNodeParser, SentenceSplitter
from openai import BadRequestError

unsafe_splitter = SemanticSplitterNodeParser(
    buffer_size=2,
    breakpoint_percentile_threshold=75,
    embed_model=Settings.embed_model,
    show_progress=True,
    include_metadata=True,
)

safe_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=32,
    include_metadata=True,
)

all_nodes = []

documents_count = len(documents)

for i, document in enumerate(documents):
    print(f"Processing document {i} of {documents_count}.")
    nodes = []
    try:
        nodes = unsafe_splitter.get_nodes_from_documents([document])
    except BadRequestError:
        print("Parsing error: openai bad request. Parse by safe splitter.")
        nodes = safe_splitter.get_nodes_from_documents([document])

    all_nodes.extend(nodes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

3 participants