[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

benbenz · 2024-03-26T07:58:37Z

Bug Description

SemanticSplitterNodeParser is producing nodes with chunks that are too large for the embedding model. It would be nice if it had a security maximum length to avoid that. This occurs especially on text data that is structured like a large indexing file listing file names and their metadata for exemple (this is how I triggered the error). I tried to chain it with a SentenceSplitterNodeParser as a security layer but it caused the characters indexes to be lost, in relation to the original content.
Maybe I am not using it correctly or there is a node processor that could come to the rescue?

Version

0.10.19

Steps to Reproduce

add a large index file in the format

File: /local/pathtofile1
Création date: 2024-01-15
Modified date: 2024-02-01 
Modified date time: 2024-02-01 14:03:60.000
File size: 230
File type: doc

File: /local/pathtofile2
Création date: 2024-01-15
Modified date: 2024-02-01 
Modified date time: 2024-02-01 14:03:60.000
File size: 230
File type: doc

....

use the SemanticSplitterNodeParser for indexing (cf semantic node parser page)

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-03-26T07:58:49Z

Hey @benbenz, good to see you back with another intriguing puzzle for us to solve! Hope you've been doing well.

Thank you for bringing this issue to our attention. It seems like you're encountering a problem where the SemanticSplitterNodeParser is producing chunks that are too large for the embedding model, especially when processing large indexing files. This is a known issue and there are a few ways to handle it.

Firstly, you can implement a mechanism to check the size of the chunks against the maximum input size of your embedding model before they are processed. If a chunk exceeds this size, you can further split the chunk into smaller parts. However, it's crucial to maintain the character indexes of each chunk relative to the original content.

Here's an approach to achieve this:

Determine the maximum input size (in terms of characters or tokens) that your embedding model can handle.
Modify the SemanticSplitterNodeParser to include a check for the size of the chunks against this maximum size before processing them.
If a chunk exceeds the maximum size, split it into smaller parts, ensuring that each part is within the size limit. Implement a method to split the chunks efficiently while preserving their semantic integrity as much as possible.
Keep track of the character indexes for each chunk in relation to the original document. This can be done by maintaining a running total of characters processed and adjusting the indexes of subsequent chunks accordingly.

Here's a code snippet illustrating how you might implement these steps in the context of your SemanticSplitterNodeParser:

def ensure_chunk_size(self, sentences, max_size):
    """
    Ensure that each chunk is within the maximum size limit.
    If a chunk exceeds the limit, it is further split.
    This function returns a list of chunks that are all within the size limit.
    """
    processed_chunks = []
    for sentence in sentences:
        combined_sentence = sentence["combined_sentence"]
        if len(combined_sentence) > max_size:
            # Further split the combined_sentence into smaller parts
            parts = self.split_to_fit(combined_sentence, max_size)
            processed_chunks.extend(parts)
        else:
            processed_chunks.append(sentence)
    return processed_chunks

def split_to_fit(self, text, max_size):
    """
    Split the text into parts where each part is within the max_size limit.
    This is a simplistic approach and can be replaced with a more sophisticated method.
    """
    parts = []
    current_part = ""
    for sentence in text.split('. '):
        if len(current_part) + len(sentence) + 1 > max_size:
            parts.append({"sentence": current_part, "combined_sentence": current_part})
            current_part = sentence
        else:
            current_part += (". " + sentence).strip()
    if current_part:
        parts.append({"sentence": current_part, "combined_sentence": current_part})
    return parts

# Modify the test_split_and_permutated function or wherever the splitting logic is applied
# to include a call to ensure_chunk_size, passing the maximum size your embedding model can handle.
# For example:
max_embedding_size = 512  # Assuming the embedding model can handle up to 512 characters
sentences = node_parser._build_sentence_groups(text_splits)
sentences_within_limit = node_parser.ensure_chunk_size(sentences, max_embedding_size)

This approach ensures that each chunk processed by the SemanticSplitterNodeParser is within the size limit of the embedding model, and it maintains the character indexes by adjusting the splitting logic to keep track of the sizes of the chunks. Remember, the split_to_fit method provided here is quite basic and may need to be refined to better preserve the semantic integrity of the text.

You can find more information about this in the LlamaIndex codebase.

I hope this helps! If you have any further questions, please don't hesitate to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich · 2024-03-26T20:05:14Z

The only alternative is chaining with a second splitter. I'm not really sure what else you could do

Not sure what you meant by "but it caused the characters indexes to be lost" -- if the chunk is too big, indeed it will be split, and might lose some context.

benbenz · 2024-03-26T20:24:51Z

The only alternative is chaining with a second splitter. I'm not really sure what else you could do

I had tried to chain it with the SentenceSplitterNodeParser and it worked as far as limiting chunks

Not sure what you meant by "but it caused the characters indexes to be lost" -- if the chunk is too big, indeed it will be split, and might lose some context.

The issue was that the character offset (start_idx) is computed in relation to each node coming out of the semantic parser, not the original document. So we basically loose the character offset in relation to each Document, sadly

benbenz · 2024-03-27T10:25:42Z

I ended up impletting a safety net with this class:

from typing import Sequence, Any, List
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
import logging
from llama_index.core.schema import BaseNode, Document , ObjectType , TextNode
from llama_index.core.constants import DEFAULT_CHUNK_SIZE
from llama_index.core.node_parser.text.sentence import SENTENCE_CHUNK_OVERLAP

class SafeSemanticSplitter(SemanticSplitterNodeParser):

    safety_chunker : SentenceSplitter = SentenceSplitter(chunk_size=DEFAULT_CHUNK_SIZE*4,chunk_overlap=SENTENCE_CHUNK_OVERLAP)

    def _parse_nodes(
        self,
        nodes: Sequence[BaseNode],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> List[BaseNode]:
        all_nodes : List[BaseNode] = super()._parse_nodes(nodes=nodes,show_progress=show_progress,**kwargs)
        all_good = True
        for node in all_nodes:
            if node.get_type()==ObjectType.TEXT:
                node:TextNode=node
                if self.safety_chunker._token_size(node.text)>self.safety_chunker.chunk_size:
                    logging.info("Chunk size too big after semantic chunking: switching to static chunking")
                    all_good = False
                    break
        if not all_good:
            all_nodes = self.safety_chunker._parse_nodes(nodes,show_progress=show_progress,**kwargs)
        return all_nodes

arxor · 2024-08-21T03:16:52Z

this worked for me:

from llama_index.core.node_parser import SemanticSplitterNodeParser, SentenceSplitter
from openai import BadRequestError

unsafe_splitter = SemanticSplitterNodeParser(
    buffer_size=2,
    breakpoint_percentile_threshold=75,
    embed_model=Settings.embed_model,
    show_progress=True,
    include_metadata=True,
)

safe_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=32,
    include_metadata=True,
)

all_nodes = []

documents_count = len(documents)

for i, document in enumerate(documents):
    print(f"Processing document {i} of {documents_count}.")
    nodes = []
    try:
        nodes = unsafe_splitter.get_nodes_from_documents([document])
    except BadRequestError:
        print("Parsing error: openai bad request. Parse by safe splitter.")
        nodes = safe_splitter.get_nodes_from_documents([document])

    all_nodes.extend(nodes)

benbenz added bug Something isn't working triage Issue needs to be triaged/prioritized labels Mar 26, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 26, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 3, 2024

seankim658 mentioned this issue Jul 8, 2024

Semantic Chunking Chunk Size Bug biocompute-objects/bco-rag#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

benbenz commented Mar 26, 2024

dosubot bot commented Mar 26, 2024 •

edited

Loading

About Dosu

logan-markewich commented Mar 26, 2024

benbenz commented Mar 26, 2024

benbenz commented Mar 27, 2024 •

edited

Loading

arxor commented Aug 21, 2024 •

edited

Loading

[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

Comments

benbenz commented Mar 26, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Mar 26, 2024 • edited Loading

Sources

About Dosu

logan-markewich commented Mar 26, 2024

benbenz commented Mar 26, 2024

benbenz commented Mar 27, 2024 • edited Loading

arxor commented Aug 21, 2024 • edited Loading

dosubot bot commented Mar 26, 2024 •

edited

Loading

benbenz commented Mar 27, 2024 •

edited

Loading

arxor commented Aug 21, 2024 •

edited

Loading