Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How to split or seperate nodes Doucment data? #7569

Closed
1 task done
iriye opened this issue Sep 6, 2023 · 2 comments
Closed
1 task done

[Question]: How to split or seperate nodes Doucment data? #7569

iriye opened this issue Sep 6, 2023 · 2 comments
Labels
question Further information is requested

Comments

@iriye
Copy link

iriye commented Sep 6, 2023

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

JSONReader = download_loader("JSONReader")
loader = JSONReader()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
when I use this code documents prompt size very big, How to split documents text i or how to seperate nodes?

@iriye iriye added the question Further information is requested label Sep 6, 2023
@dosubot
Copy link

dosubot bot commented Sep 6, 2023

🤖

Hello,

Based on your question, it seems you want to split your large documents into smaller chunks when using the VectorStoreIndex.from_documents method in LlamaIndex. While LlamaIndex does not provide a built-in method to do this, you can implement this functionality yourself before passing the documents to the from_documents method.

Here is a simple example of how you can do this:

from llama_index.schema import Document

def split_document(document: Document, chunk_size: int):
    text = document.text
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return [Document(text=chunk, metadata=document.metadata) for chunk in chunks]

# Assume `documents` is your list of large documents
smaller_documents = []
for document in documents:
    smaller_documents.extend(split_document(document, chunk_size=1000))

index = VectorStoreIndex.from_documents(smaller_documents)

In this example, split_document is a function that splits the text of a document into chunks of a specified size (in this case, 1000 characters), and creates a new Document object for each chunk. The metadata of the original document is preserved for each chunk.

Please adjust the chunk_size according to your needs. This is just a simple example and might not cover all your needs, especially if your documents have complex structures or specific requirements.

Additionally, you might want to consider the chunk_size parameter in LlamaIndex's SentenceSplitter class. This parameter is used to determine the maximum number of tokens that each chunk of text can contain when the text is being split into chunks. The larger the chunk_size, the larger each chunk of text will be, and vice versa.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich
Copy link
Collaborator

You can also just decrease the chunk size in the service context

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=512, ...)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants