-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: run out of memory due to indexing/embedding documents #12882
base: main
Are you sure you want to change the base?
Changes from 9 commits
5079ed7
42189fb
7e71501
a9ae412
bb06588
8dd961b
6cd40a5
aef34ba
e3358b9
57e6d89
87f16fc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,9 +152,13 @@ def get_vector_factory(vector_type: str) -> type[AbstractVectorFactory]: | |
raise ValueError(f"Vector store {vector_type} is not supported.") | ||
|
||
def create(self, texts: Optional[list] = None, **kwargs): | ||
max_batch_documents = 1000 | ||
if texts: | ||
embeddings = self._embeddings.embed_documents([document.page_content for document in texts]) | ||
self._vector_processor.create(texts=texts, embeddings=embeddings, **kwargs) | ||
for i in range(0, len(texts), max_batch_documents): | ||
batch_documents = texts[i : i + max_batch_documents] | ||
batch_contents = [document.page_content for document in batch_documents] | ||
batch_embeddings = self._embeddings.embed_documents(batch_contents) | ||
self._vector_processor.create(texts=batch_documents, embeddings=batch_embeddings, **kwargs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's risky and wrong to repeatedly create the collections and the underlying indexes in vdb, which may cause inconsistency or errors.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
def add_texts(self, documents: list[Document], **kwargs): | ||
if kwargs.get("duplicate_check", False): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_text()
won't ceate collectionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it's curiously to find out
self._vector_processor.create
is used in both thecreate
and also theadd_texts
invector_factory.py
, which may possibly cause repeated index creation (distrubuted lock in redis avoiding it), even without this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge this PR first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should
vector.add_texts
call_vector_processor.add_texts
instead of_vector_processor.create
at line 164?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bowenliang123 @JohnJyong please check