-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347
Comments
up to 15 clients recall values are all non zero. Beyond 18 Clients 0 values start to get populated . This is 3 nodes. |
There's a proposal to look into index / search clients scaling in OSB (more info can be found here). What's the load generation host configuration? |
16 vcpu and 64 G memory. From the CPU utilisation with 40 client on the load generator was less than 40% |
When you mention 3 nodes, are you saying that there are three LG Hosts or a single load generation host running OSB against a 3 node cluster? |
Single load generator with 3 cluster nodes
…On Thu, 25 Jul, 2024, 22:44 Ian Hoang, ***@***.***> wrote:
up to 15 clients recall values are all non zero. Beyond 18 Clients 0
values start to get populated . This is 3 nodes.
When you mention 3 nodes, are you saying that there are three LG Hosts or
a single load generation host running OSB against a 3 node cluster?
—
Reply to this email directly, view it on GitHub
<#347 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZK35FDW6ILEAHL6G4O7NLZOEW6BAVCNFSM6AAAAABLDJ24EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGAYTCMZUGM>
.
You are receiving this because you commented.Message ID:
<opensearch-project/opensearch-benchmark-workloads/issues/347/2251011343@
github.com>
|
@layavadi To help with the investigation, could you attach some charts associated with the tests you have been running. It'd be good to include three charts:
|
Discussed this offline with @layavadi and @VijayanB. This issue occurs when the user specifies more clients than the number of CPU cores in the load generation host. After closer inspection, it might be related to recall implementation. Will work closely with @VijayanB to better understand recall implementation in OSB and make improvements if necessay. |
Steps to reproduce:
Summary report will have recall@k < 0.9 If you replace search client with 5, recall@k will be 0.9/1 => This is the expected behavior |
Recall is calculated at this place: https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/worker_coordinator/runner.py#L1268 Vector Search Query is created here https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/workload/params.py#L1182 Neighbors are retrieved here https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/workload/params.py#L1163-L1175 |
@VijayanB I've reproduced the setup and have run SetupLG Host: c5.2xlarge ResultsBoth show poor mean recall@k: 20 Search Clients from report
5 search clients from report
Both tests had 0% error rates. Confirmed that this is using the I have attached both test execution json files for more info. Is there anything that I am doing differently from you? |
Synced with Vijayan and he experienced the same phenomenon. He switched to Will try with a |
Based on @VijayanB's suggestion, moved to a 16 core machine and created a script that reran the same test with various clients. Recall is indeed decreasing when there are more clients than cores. Will need to look into architecture in
|
Included short term fix into OSB and have now updated vectorsearch README |
@IanHoang Can this be closed and long term fix can be tracked in an RFC or meta issue? |
Yes, this can be closed. An RFC will be more appropriate to track the long term fix for this. |
What is the bug?
When executing vector search workload with large number of search clients ( if each clients gets < 5% of queries ), recall is very poor. This is not problem with vector search algorithm since for same dataset recall is 0.9 if search client is substantially lesser.
How can one reproduce the bug?
Execute 10m corpus vector search workload with search clients > 40
What is the expected behavior?
Recall should not be impacted
What is your host/environment?
N/A
Do you have any screenshots?
N/A
Do you have any additional context?
N/A
The text was updated successfully, but these errors were encountered: