-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: make ElasticSearchDocumentStore
use batch_size
in get_documents_by_id
#3166
bug: make ElasticSearchDocumentStore
use batch_size
in get_documents_by_id
#3166
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling this is going to be slower because of the network roundtrips the ES and OS clients would do, out of curiosity did you test this out in a way that the search
is performed multiple times?
If we confirm it's slower, I would consider either not implementing this "by choice" or put a caveat in the documentation.
Honestly, I saw the issue and simply submitted this PR. However, I understand and share your point of view. if the We can decide not to implement this behavior or add a caveat to the documentation and raise a warning for the user. @masci please let me know, if in your opinion it is worth making some tests to evaluate the retrieval times based on the |
I think it's worth it testing your branch to have a sense of what's the performance penalty in order to make an informed decision. I don't have much bandwidth now but I'll try it out. |
I made some tests on my branch (you can find them on this Colab notebook). I used ~ 17k short documents. I tested some Here are the results:
Even if the tests are very crude, as expected, it emerges that if the I see two alternative possibilities:
@masci WDYT? |
@anakin87 I've been thinking about a use case for this feature that's not speed and I got one: this would be useful to avoid sending to the cluster requests that are too big for it to handle - in this case the performance penalty would be a price users are willing to pay in order to reduce pressure on the cluster. Let's go with option number 2 then, I would just add a warning note in the docstrings, no need to emit warnings IMO. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
After some usual git mess 😄, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, waiting for the docs team to have a look at wording before merging
@masci could you request a review to the docs team? |
to performance issues. Note that Elasticsearch limits the number of results to 10,000 documents by default. | ||
Fetch documents by specifying a list of text id strings. | ||
|
||
:param ids: list of document IDs. Be aware that passing a large number of ids might lead to performance issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's capitalize the beginning of argument descriptions (i.e. "List" instead of "list"). Can we give the user a sense of what a large number of ids is? Is 10K ok? 100K? Does it depend on how much is already indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I do not know.
This passage was already part of the original docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, I retrieved 17k documents with no particular issue.
@masci @ZanSara As you can see in the logs, it seems that the CI is failing for a problem similar to that addressed in #3199. |
Related Issues
ElasticSearchDocumentStore
does not usebatch_size
inget_documents_by_id
#3153Proposed Changes:
batch_size
parameter wasn't used inget_documents_by_id
.Now the method uses
batch_size
, making several queries based on this parameter.Implementation inspired by
SQLDocumentStore
How did you test it?
Manual verification
Checklist