Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 1024). Running this sequence through the model will result in indexing errors #987

claysauruswrecks · 2023-03-30T09:29:13Z

Initially I thought the error was due to the loader not splitting chunks, but I'm still getting the mentioned error after adding a splitter. Maybe it's coming from OpenAI's API?

Bugfix branch: https://github.com/claysauruswrecks/llama-hub/tree/bugfix/github-repo-splitter

import pickle
import os
import logging
from llama_index import GPTSimpleVectorIndex

assert (
    os.getenv("OPENAI_API_KEY") is not None
), "Please set the OPENAI_API_KEY environment variable."

from llama_index import download_loader

logging.basicConfig(level=logging.DEBUG)

LLAMA_HUB_CONTENTS_URL = "https://raw.githubusercontent.com/claysauruswrecks/llama-hub/bugfix/github-repo-splitter"
LOADER_HUB_PATH = "/loader_hub"
LOADER_HUB_URL = LLAMA_HUB_CONTENTS_URL + LOADER_HUB_PATH

download_loader(
    "GithubRepositoryReader", loader_hub_url=LOADER_HUB_URL, refresh_cache=True
)

from llama_index.readers.llamahub_modules.github_repo import (
    GithubClient,
    GithubRepositoryReader,
)

docs = None

if os.path.exists("docs.pkl"):
    with open("docs.pkl", "rb") as f:
        docs = pickle.load(f)

if docs is None:
    github_client = GithubClient(os.getenv("GITHUB_TOKEN"))
    loader = GithubRepositoryReader(
        github_client,
        owner="jerryjliu",
        repo="llama_index",
        filter_directories=(
            ["gpt_index", "docs"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
        filter_file_extensions=([".py"], GithubRepositoryReader.FilterType.INCLUDE),
        verbose=True,
        concurrent_requests=10,
    )

    docs = loader.load_data(commit_sha="1b739e1fcd525f73af4a7131dd52c7750e9ca247")

    with open("docs.pkl", "wb") as f:
        pickle.dump(docs, f)

index = GPTSimpleVectorIndex.from_documents(docs)

index.query("Explain each LlamaIndex class?")

The text was updated successfully, but these errors were encountered:

claysauruswrecks · 2023-04-02T00:36:54Z

It appears I might be able to address this by using the PromptHelper to split after the loader's execution.

From Kapa.ai


Here's an example of how to set up a PromptHelper with custom parameters:

from llama_index import PromptHelper

# Set maximum input size
max_input_size = 1024
# Set number of output tokens
num_output = 256
# Set maximum chunk overlap
max_chunk_overlap = 20

prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)


Then, you can create a ServiceContext with the PromptHelper:

from llama_index import ServiceContext, LLMPredictor
from langchain import OpenAI

# Define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)


Finally, you can build your index with the service_context:

from llama_index import GPTSimpleVectorIndex
from your_data_loading_module import documents

index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)


By using the PromptHelper with the appropriate parameters, you can ensure that the input text does not exceed the model's maximum token limit and avoid the indexing errors.

For more information, refer to the PromptHelper documentation (https://gpt-index.readthedocs.io/en/latest/reference/service_context/prompt_helper.html).

jerryjliu · 2023-04-02T08:01:25Z

@claysauruswrecks instead of setting the prompt helper, one thing you can try to do is set the chunk_size_limit in the ServiceContext.

Just do

# NOTE: set a chunk size limit to < 1024 tokens 
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

does that work for you?

claysauruswrecks · 2023-04-02T08:29:25Z

@jerryjliu - Excellent, yes. I also now see the notebook examples. I will open a PR to clarify in the docs.

karottc · 2023-04-04T05:55:55Z

@jerryjliu

However, after setting it up like this, the response to response = index.query("query something") has also become shorter, losing information.

jerryjliu · 2023-04-04T06:01:29Z

by default similarity_top_k=1, you can increase similarity_top_k in index.query call

bisonliao · 2023-04-25T07:53:55Z

Is it possible to process documents with 2000 text files each has 5000 words?
I want use LLaMA-index to process my website doc, then create a smart assistant.

pramitchoudhary · 2023-05-10T19:18:42Z

# NOTE: set a chunk size limit to < 1024 tokens 
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)

Any concern about not exposing other params of Prompt Helper via ServiceContext.from_defaults? especially max_chunk_overlap

Shane-Khong · 2023-05-13T01:44:33Z

# NOTE: set a chunk size limit to < 1024 tokens 
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
Any concern about not exposing other params of Prompt Helper via ServiceContext.from_defaults? especially max_chunk_overlap

I have a similar question, so hopefully not repeating here: does [directly inputting chunk_size_limit=512 parameter into service_context] do the same thing as [setting chunk_size_limit=512 in prompt_helper, and then inputting prompt_helper as paramater into service_context]?

Shane-Khong · 2023-05-13T01:44:38Z

Also, will setting chunk_size_limit = 512 result in a better outcome than chunk_size_limit = 2000 when summarising 280 page document?

dxiaosa · 2023-05-27T03:40:07Z

@claysauruswrecks instead of setting the prompt helper, one thing you can try to do is set the chunk_size_limit in the ServiceContext.

Just do
# NOTE: set a chunk size limit to < 1024 tokens 
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
does that work for you?

Hello, "text-davinci-003" model can get 4,097 tokens at most, I just wonder why we still have the problem "Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024)."?

Majidbadal · 2023-06-26T21:24:44Z

This issue is about max output tokens I believe and not the input tokens

dosubot · 2023-09-25T16:02:06Z

Hi, @claysauruswrecks! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you raised is related to a token indices sequence length being longer than the specified maximum sequence length for a model. You suspect that the error may be coming from OpenAI's API and have provided a bugfix branch for reference. There have been discussions about using PromptHelper or setting the chunk_size_limit in the ServiceContext to address the issue. Some users have also raised questions about the impact on response length and the possibility of processing large documents.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LlamaIndex repository!

claysauruswrecks mentioned this issue Apr 2, 2023

WIP: github_repo chunk with TextSplitter run-llama/llama-hub#154

Closed

guangzhengli mentioned this issue Apr 17, 2023

想知道作者怎么解决token超出1024时的解决方案，实在程序中分别输入token吗 guangzhengli/ChatFiles#41

Closed

Shane-Khong mentioned this issue May 13, 2023

Token indices sequence length is longer than the specified maximum sequence length for this model (2503 > 1024). Running this sequence through the model will result in indexing errors? #3242

Closed

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 25, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 2, 2023

dosubot bot mentioned this issue Oct 5, 2023

[Question]: ModelError: Your input is too long. Max input length is 4096 tokens, but you supplied 5441 tokens. #7974

Closed

1 task

dosubot bot mentioned this issue Mar 29, 2024

[Question]: Token limit openai #12404

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 1024). Running this sequence through the model will result in indexing errors #987

Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 1024). Running this sequence through the model will result in indexing errors #987

claysauruswrecks commented Mar 30, 2023

claysauruswrecks commented Apr 2, 2023

jerryjliu commented Apr 2, 2023

claysauruswrecks commented Apr 2, 2023

karottc commented Apr 4, 2023

jerryjliu commented Apr 4, 2023

bisonliao commented Apr 25, 2023

pramitchoudhary commented May 10, 2023

Shane-Khong commented May 13, 2023 •

edited

Loading

Shane-Khong commented May 13, 2023 •

edited

Loading

dxiaosa commented May 27, 2023

Majidbadal commented Jun 26, 2023

dosubot bot commented Sep 25, 2023

Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 1024). Running this sequence through the model will result in indexing errors #987

Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 1024). Running this sequence through the model will result in indexing errors #987

Comments

claysauruswrecks commented Mar 30, 2023

claysauruswrecks commented Apr 2, 2023

jerryjliu commented Apr 2, 2023

claysauruswrecks commented Apr 2, 2023

karottc commented Apr 4, 2023

jerryjliu commented Apr 4, 2023

bisonliao commented Apr 25, 2023

pramitchoudhary commented May 10, 2023

Shane-Khong commented May 13, 2023 • edited Loading

Shane-Khong commented May 13, 2023 • edited Loading

dxiaosa commented May 27, 2023

Majidbadal commented Jun 26, 2023

dosubot bot commented Sep 25, 2023

Shane-Khong commented May 13, 2023 •

edited

Loading

Shane-Khong commented May 13, 2023 •

edited

Loading