Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Bruce337f · 2023-05-11T09:33:01Z

https://github.com/jerryjliu/llama_index/issues/987#issuecomment-1493259768

I try to use the method here, but it doesn't work, is it because the embedding ada model only supports 1024 maximum tokens?

NOTE: set a chunk size limit to < 1024 tokens

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

logan-markewich · 2023-05-12T01:17:21Z

Are your documents in english? If not, you might want to also consider using a different text splitter, maybe the recursive character splitter from langchain?

from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(..., node_parser=SimpleNodeParser(text_splitter=RecursiveCharacterTextSplitter()))

Bruce337f · 2023-05-12T01:55:06Z

thx,i need try!

Shane-Khong · 2023-05-13T01:52:39Z

Bump:

did RecursiveCharacterTextSplitter result in a better outcome? I'm trying to do indexing for a 280+ page document using Llama Index and coming up with same issue.
if yes, then can i include both RecursiveCharacterTextSplitter() and prompt_helper() into the service_context() paramaters? Of course, prompt_helper paramaters would not overlap with RecursiveCharacterTextSplitter's, just wondering if you can stack both within service_context().

Also, linking to similar topic to assist with resolving quicker: #987

abhishek22-ai · 2023-05-25T06:50:37Z

Hi @jerryjliu,
I have gone through the suggestions mentioned here & the previous same issue,
here's the code snippet

self.backup_separators = ["\n\n", "\n", " ", ".", ",", "!", ""]
# Text splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
separators=self.backup_separators,
)
# Service Context
self.service_context = ServiceContext.from_defaults(
    llm_predictor=self.llm_predictor,
    prompt_helper=self.prompt_helper,
    embed_model=self.embedding_model,
    node_parser=SimpleNodeParser(text_splitter=self.text_splitter),
    chunk_size_limit=self.chunk_size,
)

But we still face the error after the warning message
Token indices sequence length is longer than the specified maximum sequence length for this model (7367 > 1024). Running this sequence through the model will result in indexing errors?

ValueError: Got a larger chunk overlap (150) than chunk size (-2975), should be smaller.

sahil-springworks · 2023-05-29T10:08:54Z

@jerryjliu @logan-markewich @ravi03071991 do we have any available solution for this?

logan-markewich · 2023-06-10T16:55:30Z

@sahil-springworks @abhishek22-ai @Shane-Khong @Bruce337f

If any of you can provide a document+code that reliably reproduces the issue, that would help immensely. It's nearly impossible to track this down otherwise.

dosubot · 2023-09-09T16:54:37Z

Hi, @Bruce337f! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you are experiencing indexing errors because the token indices sequence length is longer than the specified maximum sequence length for the model. You mentioned that the embedding ada model only supports a maximum of 1024 tokens, and you're wondering if that's the cause of the issue. There have been suggestions in the comments to try using a different text splitter and code snippets provided, but it seems that the issue still persists.

To help us further investigate and resolve this issue, we kindly request that you provide a document and code that reliably reproduces the problem. This will greatly assist us in understanding the root cause and finding a solution.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. We look forward to your response.

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 9, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 16, 2023

dosubot bot mentioned this issue Oct 6, 2023

[Question]: ModelError: Your input is too long. Max input length is 4096 tokens, but you supplied 5441 tokens. #7974

Closed

1 task

dosubot bot mentioned this issue Mar 6, 2024

LlamaCPP #11692

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Bruce337f commented May 11, 2023

logan-markewich commented May 12, 2023

Bruce337f commented May 12, 2023

Shane-Khong commented May 13, 2023 •

edited

Loading

abhishek22-ai commented May 25, 2023 •

edited

Loading

sahil-springworks commented May 29, 2023

logan-markewich commented Jun 10, 2023

dosubot bot commented Sep 9, 2023

Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Comments

Bruce337f commented May 11, 2023

NOTE: set a chunk size limit to < 1024 tokens

logan-markewich commented May 12, 2023

Bruce337f commented May 12, 2023

Shane-Khong commented May 13, 2023 • edited Loading

abhishek22-ai commented May 25, 2023 • edited Loading

sahil-springworks commented May 29, 2023

logan-markewich commented Jun 10, 2023

dosubot bot commented Sep 9, 2023

Shane-Khong commented May 13, 2023 •

edited

Loading

abhishek22-ai commented May 25, 2023 •

edited

Loading