Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix chroma update_document to embed entire documents, fixes a characer-wise embedding bug #5584

Merged
merged 8 commits into from
Jun 2, 2023

Conversation

cnellington
Copy link
Contributor

@cnellington cnellington commented Jun 1, 2023

Chroma update_document full document embeddings bugfix

Chroma update_document takes a single document, but treats the page_content sting of that document as a list when getting the new document embedding.

This is a two-fold problem, where the resulting embedding for the updated document is incorrect (it's only an embedding of the first character in the new page_content) and it calls the embedding function for every character in the new page_content string, using many tokens in the process.

Fixes #5582

Before submitting

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

Tagging @dev2049 for vectorstore bugfix

Copy link
Contributor

@dev2049 dev2049 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch!

tests/integration_tests/vectorstores/test_chroma.py Outdated Show resolved Hide resolved
@dev2049
Copy link
Contributor

dev2049 commented Jun 2, 2023

thanks @cnellington!

@dev2049 dev2049 merged commit c5a7a85 into langchain-ai:master Jun 2, 2023
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
…r-wise embedding bug (langchain-ai#5584)

# Chroma update_document full document embeddings bugfix

Chroma update_document takes a single document, but treats the
page_content sting of that document as a list when getting the new
document embedding.

This is a two-fold problem, where the resulting embedding for the
updated document is incorrect (it's only an embedding of the first
character in the new page_content) and it calls the embedding function
for every character in the new page_content string, using many tokens in
the process.

Fixes langchain-ai#5582


Co-authored-by: Caleb Ellington <[email protected]>
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chroma.update_document bug
2 participants