Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id #3166

Merged
merged 5 commits into from
Sep 26, 2022
Merged

bug: make ElasticSearchDocumentStore use batch_size in get_documents_by_id #3166

merged 5 commits into from
Sep 26, 2022

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Sep 5, 2022

Related Issues

Proposed Changes:

batch_size parameter wasn't used in get_documents_by_id.
Now the method uses batch_size, making several queries based on this parameter.
Implementation inspired by SQLDocumentStore

How did you test it?

Manual verification

Checklist

@anakin87 anakin87 requested a review from a team as a code owner September 5, 2022 20:21
@anakin87 anakin87 requested review from masci and removed request for a team September 5, 2022 20:21
@anakin87 anakin87 marked this pull request as draft September 5, 2022 21:50
@anakin87 anakin87 marked this pull request as ready for review September 5, 2022 21:50
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling this is going to be slower because of the network roundtrips the ES and OS clients would do, out of curiosity did you test this out in a way that the search is performed multiple times?

If we confirm it's slower, I would consider either not implementing this "by choice" or put a caveat in the documentation.

@anakin87
Copy link
Member Author

anakin87 commented Sep 6, 2022

Honestly, I saw the issue and simply submitted this PR.
I only verified that with this PR the documents can be actually retrieved in batches. I haven't asked myself many questions about slowness/performance.

However, I understand and share your point of view. if the batch_size is small, the retrieval of documents can become slow...

We can decide not to implement this behavior or add a caveat to the documentation and raise a warning for the user.

@masci please let me know, if in your opinion it is worth making some tests to evaluate the retrieval times based on the batch_size.

@masci
Copy link
Contributor

masci commented Sep 8, 2022

I think it's worth it testing your branch to have a sense of what's the performance penalty in order to make an informed decision. I don't have much bandwidth now but I'll try it out.

@anakin87
Copy link
Member Author

anakin87 commented Sep 9, 2022

I made some tests on my branch (you can find them on this Colab notebook).

I used ~ 17k short documents. I tested some batch_size values between 10000 (max) and 1, each for 5 times.

Here are the results:
batch_size i time
0 10000 0 1.68436
1 10000 1 1.52868
2 10000 2 1.51483
3 10000 3 1.01676
4 10000 4 1.08534
5 5000 0 1.56209
6 5000 1 1.10215
7 5000 2 0.880323
8 5000 3 1.15715
9 5000 4 0.867851
10 1000 0 1.47007
11 1000 1 1.05159
12 1000 2 1.03378
13 1000 3 0.999881
14 1000 4 1.30527
15 500 0 1.06979
16 500 1 1.66162
17 500 2 1.47191
18 500 3 1.07159
19 500 4 1.13922
20 100 0 2.32787
21 100 1 1.85083
22 100 2 1.65543
23 100 3 1.51368
24 100 4 1.73016
25 50 0 1.89533
26 50 1 1.6341
27 50 2 1.70037
28 50 3 1.73407
29 50 4 1.95229
30 10 0 5.19101
31 10 1 4.6031
32 10 2 4.28554
33 10 3 3.79545
34 10 4 3.54379
35 5 0 6.28028
36 5 1 5.501
37 5 2 5.09717
38 5 3 5.06938
39 5 4 5.23445
40 1 0 18.9656
41 1 1 18.6559
42 1 2 18.4501
43 1 3 19.6474
44 1 4 18.7701

Even if the tests are very crude, as expected, it emerges that if the batch_size is very low, the retrieval time becomes long.

I see two alternative possibilities:

  • we don't want the user to use low batch_size values, so we don't implement this option
  • we offer this option, but warn the user of possible performance drops (in the documentation and/or by raising a warning).

@masci WDYT?

@masci
Copy link
Contributor

masci commented Sep 10, 2022

@anakin87 I've been thinking about a use case for this feature that's not speed and I got one: this would be useful to avoid sending to the cluster requests that are too big for it to handle - in this case the performance penalty would be a price users are willing to pay in order to reduce pressure on the cluster.

Let's go with option number 2 then, I would just add a warning note in the docstrings, no need to emit warnings IMO.

@anakin87 anakin87 requested a review from a team as a code owner September 10, 2022 18:32
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@anakin87 anakin87 marked this pull request as draft September 10, 2022 18:34
@anakin87 anakin87 marked this pull request as ready for review September 10, 2022 18:46
@anakin87
Copy link
Member Author

After some usual git mess 😄,
I have added some details for the batch_size parameter in the docstrings.

@anakin87 anakin87 requested a review from masci September 12, 2022 08:41
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, waiting for the docs team to have a look at wording before merging

@anakin87
Copy link
Member Author

@masci could you request a review to the docs team?

@brandenchan brandenchan self-requested a review September 20, 2022 08:06
to performance issues. Note that Elasticsearch limits the number of results to 10,000 documents by default.
Fetch documents by specifying a list of text id strings.

:param ids: list of document IDs. Be aware that passing a large number of ids might lead to performance issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's capitalize the beginning of argument descriptions (i.e. "List" instead of "list"). Can we give the user a sense of what a large number of ids is? Is 10K ok? 100K? Does it depend on how much is already indexed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I do not know.
This passage was already part of the original docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests, I retrieved 17k documents with no particular issue.

haystack/document_stores/elasticsearch.py Outdated Show resolved Hide resolved
haystack/document_stores/elasticsearch.py Outdated Show resolved Hide resolved
haystack/document_stores/elasticsearch.py Outdated Show resolved Hide resolved
@anakin87 anakin87 marked this pull request as draft September 20, 2022 21:42
@anakin87 anakin87 marked this pull request as ready for review September 20, 2022 21:43
@anakin87 anakin87 marked this pull request as draft September 20, 2022 23:03
@anakin87 anakin87 marked this pull request as ready for review September 20, 2022 23:04
@anakin87
Copy link
Member Author

@masci @ZanSara As you can see in the logs, it seems that the CI is failing for a problem similar to that addressed in #3199.

@anakin87 anakin87 marked this pull request as draft September 21, 2022 12:54
@anakin87 anakin87 marked this pull request as ready for review September 21, 2022 12:55
@masci masci merged commit b579b9d into deepset-ai:main Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ElasticSearchDocumentStore does not use batch_size in get_documents_by_id
3 participants