Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: run out of memory due to indexing/embedding documents #12882

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
8 changes: 6 additions & 2 deletions api/core/rag/datasource/vdb/vector_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,13 @@ def get_vector_factory(vector_type: str) -> type[AbstractVectorFactory]:
raise ValueError(f"Vector store {vector_type} is not supported.")

def create(self, texts: Optional[list] = None, **kwargs):
max_batch_documents = 1000
if texts:
embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
self._vector_processor.create(texts=texts, embeddings=embeddings, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_text() won't ceate collection

Copy link
Contributor

@bowenliang123 bowenliang123 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it's curiously to find out self._vector_processor.create is used in both the create and also the add_texts in vector_factory.py, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge this PR first?

Copy link
Contributor Author

@rayshaw001 rayshaw001 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it's curiously to find out self._vector_processor.create is used in both the create and also the add_texts in vector_factory.py, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.

should vector.add_texts call _vector_processor.add_texts instead of _vector_processor.create at line 164?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bowenliang123 @JohnJyong please check

for i in range(0, len(texts), max_batch_documents):
batch_documents = texts[i : i + max_batch_documents]
batch_contents = [document.page_content for document in batch_documents]
batch_embeddings = self._embeddings.embed_documents(batch_contents)
self._vector_processor.create(texts=batch_documents, embeddings=batch_embeddings, **kwargs)
Copy link
Contributor

@bowenliang123 bowenliang123 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's risky and wrong to repeatedly create the collections and the underlying indexes in vdb, which may cause inconsistency or errors.
Correct it into :

  1. create the collection first , with empy array
  2. looping the batched documents and use add_text to append to the existed collection

Copy link
Contributor Author

@rayshaw001 rayshaw001 Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is add_texts thread safe?
add_texts filter duplicated documents.
there are 10 workers running concurrently,
image


def add_texts(self, documents: list[Document], **kwargs):
if kwargs.get("duplicate_check", False):
Expand Down
Loading