Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Chunking Chunk Size Bug #11

Open
seankim658 opened this issue Jul 8, 2024 · 3 comments
Open

Semantic Chunking Chunk Size Bug #11

seankim658 opened this issue Jul 8, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@seankim658
Copy link
Member

seankim658 commented Jul 8, 2024

Llamaindex's SemanticSplitterNodeParser can sometimes produce chunks that are too large for the embedding model. Unfortunately there is no max length option for the semantic chunking to avoid this issue.

Will have to eventually subclass the SemanticSplitterNodeParser and create a two level safety net that will naively split large chunks into sub-chunks in order to stay under the embedding model input token limits.

Reference:
run-llama/llama_index#12270

@seankim658 seankim658 added the bug Something isn't working label Jul 8, 2024
@seankim658 seankim658 self-assigned this Jul 8, 2024
@a-gorczew
Copy link

I'm observing the same issue and not sometimes but for the every library I'm trying to parse using it. Without fixing it, seems like this node parses is useless. Error which I'm observing:

\venv\lib\site-packages\openai\_base_client.py", line 993, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8193 tokens (8193 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

@seankim658
Copy link
Member Author

@a-gorczew yeah I haven't played around too much with it after initially running into the chunk size issue. I think I tried it with some different breakpoint_percentile_threshold values but not much else besides that as its been low priority.

@jzhao62
Copy link

jzhao62 commented Nov 22, 2024

Screenshot_20241121_184446

i oberve this as well, the chunk can be extremely large. we need a way to gracefully limit the chunk size to the chosen embedded model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants