Github integration #5257

mudler · 2023-05-25T16:27:21Z

Feature request

Would be amazing to scan and get all the contents from the Github API, such as PRs, Issues and Discussions.

Motivation

this would allows to ask questions on the history of the project, issues that other users might have found, and much more!

Your contribution

Not really a python developer here, would take me a while to figure out all the changes required.

UmerHA · 2023-05-26T19:57:23Z

Sounds interesting! I'm on it :)

# Creates GitHubLoader (#5257) GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub. Fixes #5257 --------- Co-authored-by: Dev 2049 <[email protected]>

mudler · 2023-05-31T12:28:25Z

@UmerHA @dev2049 thank you!

I'm trying this now, but I'm failing to use it with chroma:

│ Traceback (most recent call last):                                                                                                                                                                                │
│   File "/app/main.py", line 76, in <module>                                                                                                                                                                       │
│     build_knowledgebase(SITEMAP)                                                                                                                                                                                  │
│   File "/app/app/memory_ops.py", line 117, in build_knowledgebase                                                                                                                                                 │
│     db = Chroma.from_documents(texts, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)                                                                                           │
│          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                           │
│   File "/usr/local/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 433, in from_documents                                                                                                    │
│     return cls.from_texts(                                                                                                                                                                                        │
│            ^^^^^^^^^^^^^^^                                                                                                                                                                                        │
│   File "/usr/local/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 401, in from_texts                                                                                                        │
│     chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)                                                                                                                                        │
│   File "/usr/local/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 160, in add_texts                                                                                                         │
│     self._collection.add(                                                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 101, in add                                                                                                              │
│     ids, embeddings, metadatas, documents = self._validate_embedding_set(                                                                                                                                         │
│                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 355, in _validate_embedding_set                                                                                          │
│     validate_metadatas(maybe_cast_one_to_many(metadatas))                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/types.py", line 120, in validate_metadatas                                                                                                           │
│     validate_metadata(metadata)                                                                                                                                                                                   │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/types.py", line 109, in validate_metadata                                                                                                            │
│     raise ValueError(                                                                                                                                                                                             │
│ ValueError: Expected metadata value to be a str, int, or float, got []                                                                                                                                            │
│ INFO:chromadb.db.duckdb:Persisting DB to disk, putting it in the save folder: /memory/chromadb

any ideas?

UmerHA · 2023-05-31T12:43:37Z

@mudler it seems chroma only accepts str, int & float values for metadata, and not lists. GitHubIssueLoader however also returns the metadata field labels as list.

As quick fix, you could parse that metadata field and stringify it.

@dev2049 To prevent this error, should all DocLoaders only return str/int/float for metadata, or should we add a parse method to chroma that stringifes ( & de-stringifies) lists?

mudler · 2023-05-31T13:42:51Z

@mudler it seems chroma only accepts str, int & float values for metadata, and not lists. GitHubIssueLoader however also returns the metadata field labels as list.

As quick fix, you could parse that metadata field and stringify it.

@dev2049 To prevent this error, should all DocLoaders only return str/int/float for metadata, or should we add a parse method to chroma that stringifes ( & de-stringifies) lists?

tried this with no luck:

    fixed_texts = []
    for text in texts:
        if 'metadata' in text and isinstance(text['metadata'], list):
            text['metadata'] = ','.join(text['metadata'])
        fixed_texts.append(text)

    print(f"Creating embeddings. May take some minutes...")
    db = Chroma.from_documents(fixed_texts, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)

I guess I'll be waiting for a fix(?) or am I doing something wrong here?

UmerHA · 2023-05-31T14:27:03Z

Almost correct :) Not metadata is a list, but metadata["labels"] is a list.

Here's a full working example:

import os
os.environ["GITHUB_TOKEN"] =  "..."
os.environ["OPENAI_API_KEY"] = "..."

from langchain.document_loaders import GitHubIssuesLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = GitHubIssuesLoader(
    repo="hwchase17/langchain",
    creator="UmerHA",
)

def fix_metadata(original_metadata):
    new_metadata = {}
    for k, v in original_metadata.items():
        if type(v) in [str, int, float]:
           # str, int, float are the types chroma can handle
            new_metadata[k] = v
        elif isinstance(v, list):
            new_metadata[k] = ','.join(v)
        else:
            # e.g. None, bool
            new_metadata[k] = str(v)
    return new_metadata

docs = loader.load()
for doc in docs:
    doc.metadata = fix_metadata(doc.metadata)

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(texts, embeddings)

mudler · 2023-05-31T16:16:47Z

right! 🤦 thanks for the snippet, that seems to do the trick!

# Creates GitHubLoader (#5257) GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub. Fixes #5257 --------- Co-authored-by: Dev 2049 <[email protected]>

banyalshipu · 2023-06-08T14:09:44Z

When using GitHubIssuesLoader , only getting number of comments in the response..
I want to fetch all the comments of the Pull requests..
Is there any way to do that?

UmerHA · 2023-06-13T09:02:25Z

@banyalshipu that's currently not possible
The Github issues api only returns number of comments and a comment url. To return comments, one would need to extend the GitHubIssuesLoader to process the comment url.

# Creates GitHubLoader (langchain-ai#5257) GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub. Fixes langchain-ai#5257 --------- Co-authored-by: Dev 2049 <[email protected]>

deviantony · 2023-11-29T01:22:51Z

Is there any interest to enhance this loader to support loading issue comments ? How hard would that be to achieve in your opinion @UmerHA ? I might give it a go.

Also side question but can this loader load discussions as well as issues? Or only issues and PRs ?

UmerHA · 2023-11-29T12:00:04Z

Is there any interest to enhance this loader to support loading issue comments ? How hard would that be to achieve in your opinion @UmerHA ? I might give it a go.

I don't think it's very complicated. As said above, you can get the comment URLs. You would then have have to fetch each URL.

Also side question but can this loader load discussions as well as issues? Or only issues and PRs ?

Discussions and issues are the same thing in GitHub, aren't they?

deviantony · 2023-11-29T21:21:00Z

Thanks for the update.

I don't think that discussions are grouped under issues in the API, I did a quick search and I don't think that the REST API offers support for discussions. It might be available in the GraphQL API.

mudler changed the title ~~Github issue integration~~ Github integration May 25, 2023

dev2049 added the 03 enhancement Enhancement of existing functionality label May 26, 2023

UmerHA mentioned this issue May 29, 2023

DocumentLoader for GitHub #5408

Merged

dev2049 closed this as completed in #5408 May 30, 2023

dev2049 added a commit that referenced this issue May 30, 2023

DocumentLoader for GitHub (#5408)

8259f9b

# Creates GitHubLoader (#5257) GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub. Fixes #5257 --------- Co-authored-by: Dev 2049 <[email protected]>

vowelparrot pushed a commit that referenced this issue May 31, 2023

DocumentLoader for GitHub (#5408)

0cc2bd4

# Creates GitHubLoader (#5257) GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub. Fixes #5257 --------- Co-authored-by: Dev 2049 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github integration #5257

Github integration #5257

mudler commented May 25, 2023

UmerHA commented May 26, 2023

mudler commented May 31, 2023

UmerHA commented May 31, 2023

mudler commented May 31, 2023

UmerHA commented May 31, 2023

mudler commented May 31, 2023

banyalshipu commented Jun 8, 2023

UmerHA commented Jun 13, 2023

deviantony commented Nov 29, 2023

UmerHA commented Nov 29, 2023 •

edited

Loading

deviantony commented Nov 29, 2023

Github integration #5257

Github integration #5257

Comments

mudler commented May 25, 2023

Feature request

Motivation

Your contribution

UmerHA commented May 26, 2023

mudler commented May 31, 2023

UmerHA commented May 31, 2023

mudler commented May 31, 2023

UmerHA commented May 31, 2023

mudler commented May 31, 2023

banyalshipu commented Jun 8, 2023

UmerHA commented Jun 13, 2023

deviantony commented Nov 29, 2023

UmerHA commented Nov 29, 2023 • edited Loading

deviantony commented Nov 29, 2023

UmerHA commented Nov 29, 2023 •

edited

Loading