-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242
Comments
Are your documents in english? If not, you might want to also consider using a different text splitter, maybe the recursive character splitter from langchain? from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(..., node_parser=SimpleNodeParser(text_splitter=RecursiveCharacterTextSplitter())) |
thx,i need try! |
Bump:
Also, linking to similar topic to assist with resolving quicker: #987 |
Hi @jerryjliu,
But we still face the error after the warning message
|
@jerryjliu @logan-markewich @ravi03071991 do we have any available solution for this? |
@sahil-springworks @abhishek22-ai @Shane-Khong @Bruce337f If any of you can provide a document+code that reliably reproduces the issue, that would help immensely. It's nearly impossible to track this down otherwise. |
Hi, @Bruce337f! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, you are experiencing indexing errors because the token indices sequence length is longer than the specified maximum sequence length for the model. You mentioned that the embedding ada model only supports a maximum of 1024 tokens, and you're wondering if that's the cause of the issue. There have been suggestions in the comments to try using a different text splitter and code snippets provided, but it seems that the issue still persists. To help us further investigate and resolve this issue, we kindly request that you provide a document and code that reliably reproduces the problem. This will greatly assist us in understanding the root cause and finding a solution. Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your understanding and cooperation. We look forward to your response. |
https://github.com/jerryjliu/llama_index/issues/987#issuecomment-1493259768
I try to use the method here, but it doesn't work, is it because the embedding ada model only supports 1024 maximum tokens?
NOTE: set a chunk size limit to < 1024 tokens
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
The text was updated successfully, but these errors were encountered: