-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add batching for querying in ElasticsearchDocumentStore
and OpenSearchDocumentStore
#5063
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me already. One change request is about 10_000 as default instead of 1_000 if possible. The write_documents method of the OpenSearchDocumentStore should be updated too (see the comment). Ideally, we could extend the test cases to check that also write_documents picks up the batch_size as specified in the init of OpenSearchDocumentStore/ElasticsearchDocumentStore
@@ -165,6 +166,8 @@ def __init__( | |||
index type and knn parameters). If `0`, training doesn't happen automatically but needs | |||
to be triggered manually via the `train_index` method. | |||
Default: `None` | |||
:param batch_size: Number of Documents to index at once / Number of queries to execute at once. If you face |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write_documents
method of the OpenSearchDocumentStore
also has a batch_size
parameter with a default of 10_000. If we introduce a batch_size param in the init of the document store, we should also use self.batch_size or batch_size
in write_documents
and make the parameter batch_size: Optional[int] = None
in the signature of the write_documents
method of the OpenSearchDocumentStore
.
Can we make the default 10_000 instead of 1_000 then to prevent a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed write_documents
and changed the default back to 10_000. I initially changed it to 1000 because I found this on the Elasticsearch documentation, but this information is probably outdated as it is for a former version.
Pull Request Test Coverage Report for Build 5146692208
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Related Issues
Proposed Changes:
This PR introduces an instance parameter
batch_size
toElasticsearchDocumentStore
andOpenSearchDocumentStore
that allows to set the number of queries / documents that are passed to Elasticsearch / OpenSearch usingmsearch
orbulk
, respectively.How did you test it?
I added some unit tests.
Notes for the reviewer
This is needed for cases where
EmbeddingRetriever
'srun_batch
method is called with several thousands of queries. Without this change, we run into the HTTP Error 413Transport Error: Payload too large
.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.