Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github integration #5257

Closed
mudler opened this issue May 25, 2023 · 11 comments · Fixed by #5408
Closed

Github integration #5257

mudler opened this issue May 25, 2023 · 11 comments · Fixed by #5408
Labels
03 enhancement Enhancement of existing functionality

Comments

@mudler
Copy link
Contributor

mudler commented May 25, 2023

Feature request

Would be amazing to scan and get all the contents from the Github API, such as PRs, Issues and Discussions.

Motivation

this would allows to ask questions on the history of the project, issues that other users might have found, and much more!

Your contribution

Not really a python developer here, would take me a while to figure out all the changes required.

@mudler mudler changed the title Github issue integration Github integration May 25, 2023
@dev2049 dev2049 added the 03 enhancement Enhancement of existing functionality label May 26, 2023
@UmerHA
Copy link
Contributor

UmerHA commented May 26, 2023

Sounds interesting! I'm on it :)

dev2049 added a commit that referenced this issue May 30, 2023
# Creates GitHubLoader (#5257)

GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub.

Fixes #5257

---------

Co-authored-by: Dev 2049 <[email protected]>
@mudler
Copy link
Contributor Author

mudler commented May 31, 2023

@UmerHA @dev2049 thank you!

I'm trying this now, but I'm failing to use it with chroma:

│ Traceback (most recent call last):                                                                                                                                                                                │
│   File "/app/main.py", line 76, in <module>                                                                                                                                                                       │
│     build_knowledgebase(SITEMAP)                                                                                                                                                                                  │
│   File "/app/app/memory_ops.py", line 117, in build_knowledgebase                                                                                                                                                 │
│     db = Chroma.from_documents(texts, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)                                                                                           │
│          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                           │
│   File "/usr/local/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 433, in from_documents                                                                                                    │
│     return cls.from_texts(                                                                                                                                                                                        │
│            ^^^^^^^^^^^^^^^                                                                                                                                                                                        │
│   File "/usr/local/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 401, in from_texts                                                                                                        │
│     chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)                                                                                                                                        │
│   File "/usr/local/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 160, in add_texts                                                                                                         │
│     self._collection.add(                                                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 101, in add                                                                                                              │
│     ids, embeddings, metadatas, documents = self._validate_embedding_set(                                                                                                                                         │
│                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 355, in _validate_embedding_set                                                                                          │
│     validate_metadatas(maybe_cast_one_to_many(metadatas))                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/types.py", line 120, in validate_metadatas                                                                                                           │
│     validate_metadata(metadata)                                                                                                                                                                                   │
│   File "/usr/local/lib/python3.11/site-packages/chromadb/api/types.py", line 109, in validate_metadata                                                                                                            │
│     raise ValueError(                                                                                                                                                                                             │
│ ValueError: Expected metadata value to be a str, int, or float, got []                                                                                                                                            │
│ INFO:chromadb.db.duckdb:Persisting DB to disk, putting it in the save folder: /memory/chromadb  

any ideas?

@UmerHA
Copy link
Contributor

UmerHA commented May 31, 2023

@mudler it seems chroma only accepts str, int & float values for metadata, and not lists. GitHubIssueLoader however also returns the metadata field labels as list.

As quick fix, you could parse that metadata field and stringify it.

@dev2049 To prevent this error, should all DocLoaders only return str/int/float for metadata, or should we add a parse method to chroma that stringifes ( & de-stringifies) lists?

@mudler
Copy link
Contributor Author

mudler commented May 31, 2023

@mudler it seems chroma only accepts str, int & float values for metadata, and not lists. GitHubIssueLoader however also returns the metadata field labels as list.

As quick fix, you could parse that metadata field and stringify it.

@dev2049 To prevent this error, should all DocLoaders only return str/int/float for metadata, or should we add a parse method to chroma that stringifes ( & de-stringifies) lists?

tried this with no luck:

    fixed_texts = []
    for text in texts:
        if 'metadata' in text and isinstance(text['metadata'], list):
            text['metadata'] = ','.join(text['metadata'])
        fixed_texts.append(text)

    print(f"Creating embeddings. May take some minutes...")
    db = Chroma.from_documents(fixed_texts, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)

I guess I'll be waiting for a fix(?) or am I doing something wrong here?

@UmerHA
Copy link
Contributor

UmerHA commented May 31, 2023

Almost correct :) Not metadata is a list, but metadata["labels"] is a list.

Here's a full working example:

import os
os.environ["GITHUB_TOKEN"] =  "..."
os.environ["OPENAI_API_KEY"] = "..."

from langchain.document_loaders import GitHubIssuesLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = GitHubIssuesLoader(
    repo="hwchase17/langchain",
    creator="UmerHA",
)

def fix_metadata(original_metadata):
    new_metadata = {}
    for k, v in original_metadata.items():
        if type(v) in [str, int, float]:
           # str, int, float are the types chroma can handle
            new_metadata[k] = v
        elif isinstance(v, list):
            new_metadata[k] = ','.join(v)
        else:
            # e.g. None, bool
            new_metadata[k] = str(v)
    return new_metadata

docs = loader.load()
for doc in docs:
    doc.metadata = fix_metadata(doc.metadata)

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()

db = Chroma.from_documents(texts, embeddings)

@mudler
Copy link
Contributor Author

mudler commented May 31, 2023

right! 🤦 thanks for the snippet, that seems to do the trick!

vowelparrot pushed a commit that referenced this issue May 31, 2023
# Creates GitHubLoader (#5257)

GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub.

Fixes #5257

---------

Co-authored-by: Dev 2049 <[email protected]>
@banyalshipu
Copy link

When using GitHubIssuesLoader , only getting number of comments in the response..
I want to fetch all the comments of the Pull requests..
Is there any way to do that?

@UmerHA
Copy link
Contributor

UmerHA commented Jun 13, 2023

@banyalshipu that's currently not possible
The Github issues api only returns number of comments and a comment url. To return comments, one would need to extend the GitHubIssuesLoader to process the comment url.

Undertone0809 pushed a commit to Undertone0809/langchain that referenced this issue Jun 19, 2023
# Creates GitHubLoader (langchain-ai#5257)

GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub.

Fixes langchain-ai#5257

---------

Co-authored-by: Dev 2049 <[email protected]>
@deviantony
Copy link

Is there any interest to enhance this loader to support loading issue comments ? How hard would that be to achieve in your opinion @UmerHA ? I might give it a go.

Also side question but can this loader load discussions as well as issues? Or only issues and PRs ?

@UmerHA
Copy link
Contributor

UmerHA commented Nov 29, 2023

Is there any interest to enhance this loader to support loading issue comments ? How hard would that be to achieve in your opinion @UmerHA ? I might give it a go.

I don't think it's very complicated. As said above, you can get the comment URLs. You would then have have to fetch each URL.

Also side question but can this loader load discussions as well as issues? Or only issues and PRs ?

Discussions and issues are the same thing in GitHub, aren't they?

@deviantony
Copy link

Thanks for the update.

I don't think that discussions are grouped under issues in the API, I did a quick search and I don't think that the REST API offers support for discussions. It might be available in the GraphQL API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants