Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Closed
Bruce337f opened this issue May 11, 2023 · 7 comments

Comments

@Bruce337f
Copy link

https://github.com/jerryjliu/llama_index/issues/987#issuecomment-1493259768

I try to use the method here, but it doesn't work, is it because the embedding ada model only supports 1024 maximum tokens?

NOTE: set a chunk size limit to < 1024 tokens

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

@logan-markewich
Copy link
Collaborator

Are your documents in english? If not, you might want to also consider using a different text splitter, maybe the recursive character splitter from langchain?

from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(..., node_parser=SimpleNodeParser(text_splitter=RecursiveCharacterTextSplitter()))

@Bruce337f
Copy link
Author

thx,i need try!

@Shane-Khong
Copy link

Shane-Khong commented May 13, 2023

Bump:

  • did RecursiveCharacterTextSplitter result in a better outcome? I'm trying to do indexing for a 280+ page document using Llama Index and coming up with same issue.
  • if yes, then can i include both RecursiveCharacterTextSplitter() and prompt_helper() into the service_context() paramaters? Of course, prompt_helper paramaters would not overlap with RecursiveCharacterTextSplitter's, just wondering if you can stack both within service_context().

Also, linking to similar topic to assist with resolving quicker: #987

@abhishek22-ai
Copy link

abhishek22-ai commented May 25, 2023

Hi @jerryjliu,
I have gone through the suggestions mentioned here & the previous same issue,
here's the code snippet

self.backup_separators = ["\n\n", "\n", " ", ".", ",", "!", ""]
# Text splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
separators=self.backup_separators,
)
# Service Context
self.service_context = ServiceContext.from_defaults(
    llm_predictor=self.llm_predictor,
    prompt_helper=self.prompt_helper,
    embed_model=self.embedding_model,
    node_parser=SimpleNodeParser(text_splitter=self.text_splitter),
    chunk_size_limit=self.chunk_size,
)

But we still face the error after the warning message
Token indices sequence length is longer than the specified maximum sequence length for this model (7367 > 1024). Running this sequence through the model will result in indexing errors?

ValueError: Got a larger chunk overlap (150) than chunk size (-2975), should be smaller.

@sahil-springworks
Copy link
Contributor

@jerryjliu @logan-markewich @ravi03071991 do we have any available solution for this?

@logan-markewich
Copy link
Collaborator

@sahil-springworks @abhishek22-ai @Shane-Khong @Bruce337f

If any of you can provide a document+code that reliably reproduces the issue, that would help immensely. It's nearly impossible to track this down otherwise.

@dosubot
Copy link

dosubot bot commented Sep 9, 2023

Hi, @Bruce337f! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you are experiencing indexing errors because the token indices sequence length is longer than the specified maximum sequence length for the model. You mentioned that the embedding ada model only supports a maximum of 1024 tokens, and you're wondering if that's the cause of the issue. There have been suggestions in the comments to try using a different text splitter and code snippets provided, but it seems that the issue still persists.

To help us further investigate and resolve this issue, we kindly request that you provide a document and code that reliably reproduces the problem. This will greatly assist us in understanding the root cause and finding a solution.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. We look forward to your response.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 9, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 16, 2023
@dosubot dosubot bot mentioned this issue Mar 6, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants