WIP: github_repo chunk with TextSplitter #154

claysauruswrecks · 2023-03-31T06:13:01Z

No description provided.

jerryjliu

out of curiosity what's the use case for adding text splitter within the document loader? you could always do text splitting after right?

claysauruswrecks · 2023-04-02T00:31:47Z

@jerryjliu - It wasn't immediately clear to me that is the case. I looked at how the other loaders were splitting, and figured it might be an avenue to address the error message, but it did not. Kapa.ai was helpful in that it suggested I use something like index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context) which I will try next. If it works, I will update this PR to clarify these points for anyone else going forward. Thank you for all the great work!

claysauruswrecks · 2023-04-02T00:34:36Z

For reference, I included the code in the issue run-llama/llama_index#987

claysauruswrecks · 2023-04-02T05:41:47Z

@jerryjliu - After refactoring with the recommended changes, the same error appears. Here is the revised code:

import pickle
import os
import logging
from llama_index import GPTSimpleVectorIndex, PromptHelper, ServiceContext, LLMPredictor
from langchain import OpenAI

# Set maximum input size
max_input_size = 1000
# Set number of output tokens
num_output = 256
# Set maximum chunk overlap
max_chunk_overlap = 20

prompt_helper = PromptHelper(
    max_input_size,
    num_output,
    max_chunk_overlap,
)

# Define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name="text-davinci-003"))

service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor, prompt_helper=prompt_helper
)

assert (
    os.getenv("OPENAI_API_KEY") is not None
), "Please set the OPENAI_API_KEY environment variable."

from llama_index import download_loader

logging.basicConfig(level=logging.DEBUG)

# LLAMA_HUB_CONTENTS_URL = "https://raw.githubusercontent.com/claysauruswrecks/llama-hub/bugfix/github-repo-splitter"
# LOADER_HUB_PATH = "/loader_hub"
# LOADER_HUB_URL = LLAMA_HUB_CONTENTS_URL + LOADER_HUB_PATH

# download_loader(
#     "GithubRepositoryReader", loader_hub_url=LOADER_HUB_URL, refresh_cache=True
# )

download_loader("GithubRepositoryReader")

from llama_index.readers.llamahub_modules.github_repo import (
    GithubClient,
    GithubRepositoryReader,
)

docs = None

if os.path.exists("docs.pkl"):
    with open("docs.pkl", "rb") as f:
        docs = pickle.load(f)

if docs is None:
    github_client = GithubClient(os.getenv("GITHUB_TOKEN"))
    loader = GithubRepositoryReader(
        github_client,
        owner="jerryjliu",
        repo="llama_index",
        filter_directories=(
            ["gpt_index", "docs"],
            GithubRepositoryReader.FilterType.INCLUDE,
        ),
        filter_file_extensions=([".py"], GithubRepositoryReader.FilterType.INCLUDE),
        verbose=True,
        concurrent_requests=10,
    )

    docs = loader.load_data(commit_sha="1b739e1fcd525f73af4a7131dd52c7750e9ca247")

    with open("docs.pkl", "wb") as f:
        pickle.dump(docs, f)

index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)

index.query("Explain each LlamaIndex class?")

jerryjliu · 2023-04-02T08:01:43Z

@claysauruswrecks see reply here, let me know if this works for you: run-llama/llama_index#987

EmanuelCampos · 2023-09-25T15:53:24Z

Closing for now, since inactivity and sounds like this issue shouldn't be implemented on the loader

WIP: chunk with TextSplitter

ec01447

jerryjliu reviewed Apr 1, 2023

View reviewed changes

EmanuelCampos closed this Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: github_repo chunk with TextSplitter #154

WIP: github_repo chunk with TextSplitter #154

claysauruswrecks commented Mar 31, 2023

jerryjliu left a comment

claysauruswrecks commented Apr 2, 2023 •

edited

Loading

claysauruswrecks commented Apr 2, 2023

claysauruswrecks commented Apr 2, 2023

jerryjliu commented Apr 2, 2023

EmanuelCampos commented Sep 25, 2023

WIP: github_repo chunk with TextSplitter #154

WIP: github_repo chunk with TextSplitter #154

Conversation

claysauruswrecks commented Mar 31, 2023

jerryjliu left a comment

Choose a reason for hiding this comment

claysauruswrecks commented Apr 2, 2023 • edited Loading

claysauruswrecks commented Apr 2, 2023

claysauruswrecks commented Apr 2, 2023

jerryjliu commented Apr 2, 2023

EmanuelCampos commented Sep 25, 2023

claysauruswrecks commented Apr 2, 2023 •

edited

Loading