[Question]: How to split or seperate nodes Doucment data? #7569

iriye · 2023-09-06T13:01:44Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

JSONReader = download_loader("JSONReader")
loader = JSONReader()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
when I use this code documents prompt size very big, How to split documents text i or how to seperate nodes?

dosubot · 2023-09-06T13:13:13Z

🤖

Hello,

Based on your question, it seems you want to split your large documents into smaller chunks when using the VectorStoreIndex.from_documents method in LlamaIndex. While LlamaIndex does not provide a built-in method to do this, you can implement this functionality yourself before passing the documents to the from_documents method.

Here is a simple example of how you can do this:

from llama_index.schema import Document

def split_document(document: Document, chunk_size: int):
    text = document.text
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return [Document(text=chunk, metadata=document.metadata) for chunk in chunks]

# Assume `documents` is your list of large documents
smaller_documents = []
for document in documents:
    smaller_documents.extend(split_document(document, chunk_size=1000))

index = VectorStoreIndex.from_documents(smaller_documents)

In this example, split_document is a function that splits the text of a document into chunks of a specified size (in this case, 1000 characters), and creates a new Document object for each chunk. The metadata of the original document is preserved for each chunk.

Please adjust the chunk_size according to your needs. This is just a simple example and might not cover all your needs, especially if your documents have complex structures or specific requirements.

Additionally, you might want to consider the chunk_size parameter in LlamaIndex's SentenceSplitter class. This parameter is used to determine the maximum number of tokens that each chunk of text can contain when the text is being split into chunks. The larger the chunk_size, the larger each chunk of text will be, and vice versa.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich · 2023-09-07T17:42:33Z

You can also just decrease the chunk size in the service context

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=512, ...)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=2)

iriye added the question Further information is requested label Sep 6, 2023

logan-markewich closed this as completed Sep 7, 2023

dosubot bot mentioned this issue Oct 6, 2023

[Question]: ModelError: Your input is too long. Max input length is 4096 tokens, but you supplied 5441 tokens. #7974

Closed

1 task

dosubot bot mentioned this issue Nov 12, 2023

[Bug]: When using the Milvus database, inserting too many nodes will cause the insertion to fail. #8869

Closed

dosubot bot mentioned this issue Mar 18, 2024

[Bug]: Ingestion cache bigger than 16mb for mongo db #11958

Closed

dosubot bot mentioned this issue Mar 26, 2024

[Bug]: SemanticSplitterNodeParser produces chunks that are too large for embedding model #12270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How to split or seperate nodes Doucment data? #7569

[Question]: How to split or seperate nodes Doucment data? #7569

iriye commented Sep 6, 2023

dosubot bot commented Sep 6, 2023

logan-markewich commented Sep 7, 2023

[Question]: How to split or seperate nodes Doucment data? #7569

[Question]: How to split or seperate nodes Doucment data? #7569

Comments

iriye commented Sep 6, 2023

Question Validation

Question

dosubot bot commented Sep 6, 2023

Sources

logan-markewich commented Sep 7, 2023