Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347

Closed
VijayanB opened this issue Jul 18, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@VijayanB
Copy link
Member

What is the bug?

When executing vector search workload with large number of search clients ( if each clients gets < 5% of queries ), recall is very poor. This is not problem with vector search algorithm since for same dataset recall is 0.9 if search client is substantially lesser.

How can one reproduce the bug?

Execute 10m corpus vector search workload with search clients > 40

What is the expected behavior?

Recall should not be impacted

What is your host/environment?

N/A

Do you have any screenshots?

N/A

Do you have any additional context?

N/A

@VijayanB VijayanB added bug Something isn't working untriaged labels Jul 18, 2024
@layavadi
Copy link

up to 15 clients recall values are all non zero. Beyond 18 Clients 0 values start to get populated . This is 3 nodes.

@IanHoang
Copy link
Collaborator

IanHoang commented Jul 24, 2024

There's a proposal to look into index / search clients scaling in OSB (more info can be found here).

What's the load generation host configuration?

@layavadi
Copy link

16 vcpu and 64 G memory. From the CPU utilisation with 40 client on the load generator was less than 40%

@IanHoang
Copy link
Collaborator

IanHoang commented Jul 25, 2024

up to 15 clients recall values are all non zero. Beyond 18 Clients 0 values start to get populated . This is 3 nodes.

When you mention 3 nodes, are you saying that there are three LG Hosts or a single load generation host running OSB against a 3 node cluster?

@layavadi
Copy link

layavadi commented Jul 26, 2024 via email

@IanHoang
Copy link
Collaborator

@layavadi To help with the investigation, could you attach some charts associated with the tests you have been running. It'd be good to include three charts:

  • LG Host CPU Utilization
  • Cluster CPU Utilization
  • Cluster Search Throughput

@IanHoang
Copy link
Collaborator

IanHoang commented Oct 2, 2024

Discussed this offline with @layavadi and @VijayanB. This issue occurs when the user specifies more clients than the number of CPU cores in the load generation host. After closer inspection, it might be related to recall implementation.

Will work closely with @VijayanB to better understand recall implementation in OSB and make improvements if necessay.

@VijayanB
Copy link
Member Author

VijayanB commented Oct 25, 2024

Steps to reproduce:

  1. Create an ec2-instance of type c5.2xlarge
  2. Install OSB in that ec2-instance
  3. Have 2.17 or any 2.x OpenSearch Cluster Endpoint
  4. Copy this file https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/vectorsearch/params/corpus/1million/faiss-cohere-768-dp.json into local file system
  5. Update param file by replacing "cohere-1m" with "cohere" in two places
  6. Add "search_clients" to param file and set it to 20
  7. Execute no-train-test workload
export ENDPOINT=http://<cluster_endpoint>:80
export PROCEDURE="no-train-test"
export PARAMS="/home/ec2-user/faiss-cohere-768-dp.json"
opensearch-benchmark execute-test --pipeline=benchmark-only  --target-host=$ENDPOINT --test-procedure $PROCEDURE --workload-params $PARAMS --kill-running-processes

Summary report will have recall@k < 0.9

If you replace search client with 5, recall@k will be 0.9/1 => This is the expected behavior

@IanHoang
Copy link
Collaborator

IanHoang commented Oct 30, 2024

@VijayanB I've reproduced the setup and have run no-train-test with 20 search_clients and 5 search_clients.

Setup

LG Host: c5.2xlarge
Cluster: 3data nodes of c5.2xlarge
Ensured that params file has proper replacements (cohere)

Results

Both show poor mean recall@k:

20 Search Clients from report

|                                                  Mean recall@k |         prod-queries |        0.29 |        |
|                                                  Mean recall@1 |         prod-queries |        0.16 |        |

20_search_clients.json

5 search clients from report

5_search_clients.json

|                                                  Mean recall@k |         prod-queries |        0.48 |        |
|                                                  Mean recall@1 |         prod-queries |        0.38 |        |

Both tests had 0% error rates. Confirmed that this is using the 1k documents.

I have attached both test execution json files for more info. Is there anything that I am doing differently from you?

@IanHoang
Copy link
Collaborator

Synced with Vijayan and he experienced the same phenomenon. He switched to c3.4xlarge with 16 cores. Tried with 10, 16, and 20 search clients. He was able to get recall values of 1 for 10 and 16 but not 20. 20 was 0.71.

Will try with a EC2.4xlarge instance with 16 cores. Will also spend some time debugging and seeing if there's a short term solution.

@IanHoang
Copy link
Collaborator

Based on @VijayanB's suggestion, moved to a 16 core machine and created a script that reran the same test with various clients. Recall is indeed decreasing when there are more clients than cores. Will need to look into architecture in worker_coordinator.py.

[ec2-user@ip-172-31-37-157 2024-11-15_17-44-41]$ grep -r "Mean recall@k"
16-5-clients-result:|                                                  Mean recall@k |         prod-queries |           1 |        |
16-10-clients-result:|                                                  Mean recall@k |         prod-queries |           1 |        |
16-16-clients-result:|                                                  Mean recall@k |         prod-queries |           1 |        |
16-20-clients-result:|                                                  Mean recall@k |         prod-queries |        0.77 |        |
16-30-clients-result:|                                                  Mean recall@k |         prod-queries |        0.46 |        |

@IanHoang
Copy link
Collaborator

IanHoang commented Dec 6, 2024

Included short term fix into OSB and have now updated vectorsearch README

@rishabh6788
Copy link
Collaborator

@IanHoang Can this be closed and long term fix can be tracked in an RFC or meta issue?

@IanHoang
Copy link
Collaborator

IanHoang commented Jan 6, 2025

Yes, this can be closed. An RFC will be more appropriate to track the long term fix for this.

@IanHoang IanHoang closed this as completed Jan 6, 2025
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants