[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347

VijayanB · 2024-07-18T19:38:35Z

What is the bug?

When executing vector search workload with large number of search clients ( if each clients gets < 5% of queries ), recall is very poor. This is not problem with vector search algorithm since for same dataset recall is 0.9 if search client is substantially lesser.

How can one reproduce the bug?

Execute 10m corpus vector search workload with search clients > 40

What is the expected behavior?

Recall should not be impacted

What is your host/environment?

N/A

Do you have any screenshots?

N/A

Do you have any additional context?

N/A

layavadi · 2024-07-19T05:24:04Z

up to 15 clients recall values are all non zero. Beyond 18 Clients 0 values start to get populated . This is 3 nodes.

IanHoang · 2024-07-24T21:05:42Z

There's a proposal to look into index / search clients scaling in OSB (more info can be found here).

What's the load generation host configuration?

layavadi · 2024-07-25T04:28:05Z

16 vcpu and 64 G memory. From the CPU utilisation with 40 client on the load generator was less than 40%

IanHoang · 2024-07-25T17:13:46Z

up to 15 clients recall values are all non zero. Beyond 18 Clients 0 values start to get populated . This is 3 nodes.

When you mention 3 nodes, are you saying that there are three LG Hosts or a single load generation host running OSB against a 3 node cluster?

layavadi · 2024-07-26T00:44:28Z

Single load generator with 3 cluster nodes

…

On Thu, 25 Jul, 2024, 22:44 Ian Hoang, ***@***.***> wrote: up to 15 clients recall values are all non zero. Beyond 18 Clients 0 values start to get populated . This is 3 nodes. When you mention 3 nodes, are you saying that there are three LG Hosts or a single load generation host running OSB against a 3 node cluster? — Reply to this email directly, view it on GitHub <#347 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZK35FDW6ILEAHL6G4O7NLZOEW6BAVCNFSM6AAAAABLDJ24EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJRGAYTCMZUGM> . You are receiving this because you commented.Message ID: <opensearch-project/opensearch-benchmark-workloads/issues/347/2251011343@ github.com>

IanHoang · 2024-08-26T19:07:32Z

@layavadi To help with the investigation, could you attach some charts associated with the tests you have been running. It'd be good to include three charts:

LG Host CPU Utilization
Cluster CPU Utilization
Cluster Search Throughput

IanHoang · 2024-10-02T20:44:20Z

Discussed this offline with @layavadi and @VijayanB. This issue occurs when the user specifies more clients than the number of CPU cores in the load generation host. After closer inspection, it might be related to recall implementation.

Will work closely with @VijayanB to better understand recall implementation in OSB and make improvements if necessay.

VijayanB · 2024-10-25T21:10:48Z

Steps to reproduce:

Create an ec2-instance of type c5.2xlarge
Install OSB in that ec2-instance
Have 2.17 or any 2.x OpenSearch Cluster Endpoint
Copy this file https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/vectorsearch/params/corpus/1million/faiss-cohere-768-dp.json into local file system
Update param file by replacing "cohere-1m" with "cohere" in two places
Add "search_clients" to param file and set it to 20
Execute no-train-test workload

export ENDPOINT=http://<cluster_endpoint>:80
export PROCEDURE="no-train-test"
export PARAMS="/home/ec2-user/faiss-cohere-768-dp.json"
opensearch-benchmark execute-test --pipeline=benchmark-only  --target-host=$ENDPOINT --test-procedure $PROCEDURE --workload-params $PARAMS --kill-running-processes

Summary report will have recall@k < 0.9

If you replace search client with 5, recall@k will be 0.9/1 => This is the expected behavior

VijayanB · 2024-10-25T21:15:50Z

Recall is calculated at this place: https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/worker_coordinator/runner.py#L1268

Vector Search Query is created here https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/workload/params.py#L1182

Neighbors are retrieved here https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/workload/params.py#L1163-L1175

IanHoang · 2024-10-30T20:10:05Z

@VijayanB I've reproduced the setup and have run no-train-test with 20 search_clients and 5 search_clients.

Setup

LG Host: c5.2xlarge
Cluster: 3data nodes of c5.2xlarge
Ensured that params file has proper replacements (cohere)

Results

Both show poor mean recall@k:

20 Search Clients from report

|                                                  Mean recall@k |         prod-queries |        0.29 |        |
|                                                  Mean recall@1 |         prod-queries |        0.16 |        |

20_search_clients.json

5 search clients from report

5_search_clients.json

|                                                  Mean recall@k |         prod-queries |        0.48 |        |
|                                                  Mean recall@1 |         prod-queries |        0.38 |        |

Both tests had 0% error rates. Confirmed that this is using the 1k documents.

I have attached both test execution json files for more info. Is there anything that I am doing differently from you?

IanHoang · 2024-11-14T21:39:12Z

Synced with Vijayan and he experienced the same phenomenon. He switched to c3.4xlarge with 16 cores. Tried with 10, 16, and 20 search clients. He was able to get recall values of 1 for 10 and 16 but not 20. 20 was 0.71.

Will try with a EC2.4xlarge instance with 16 cores. Will also spend some time debugging and seeing if there's a short term solution.

IanHoang · 2024-11-15T18:38:13Z

Based on @VijayanB's suggestion, moved to a 16 core machine and created a script that reran the same test with various clients. Recall is indeed decreasing when there are more clients than cores. Will need to look into architecture in worker_coordinator.py.

[ec2-user@ip-172-31-37-157 2024-11-15_17-44-41]$ grep -r "Mean recall@k"
16-5-clients-result:|                                                  Mean recall@k |         prod-queries |           1 |        |
16-10-clients-result:|                                                  Mean recall@k |         prod-queries |           1 |        |
16-16-clients-result:|                                                  Mean recall@k |         prod-queries |           1 |        |
16-20-clients-result:|                                                  Mean recall@k |         prod-queries |        0.77 |        |
16-30-clients-result:|                                                  Mean recall@k |         prod-queries |        0.46 |        |

IanHoang · 2024-12-06T21:33:31Z

Included short term fix into OSB and have now updated vectorsearch README

rishabh6788 · 2025-01-06T23:12:12Z

@IanHoang Can this be closed and long term fix can be tracked in an RFC or meta issue?

IanHoang · 2025-01-06T23:13:57Z

Yes, this can be closed. An RFC will be more appropriate to track the long term fix for this.

VijayanB added bug Something isn't working untriaged labels Jul 18, 2024

IanHoang removed the untriaged label Jul 24, 2024

IanHoang mentioned this issue Jul 25, 2024

[META] User issues related to Scaling Clients in OSB opensearch-project/opensearch-benchmark#593

Open

finnroblin mentioned this issue Aug 26, 2024

Add calculate-recall parameter to vector search and skip calculating recall if number clients > cpu cores opensearch-project/opensearch-benchmark#626

Merged

1 task

IanHoang self-assigned this Oct 29, 2024

IanHoang added this to Engineering Effectiveness Board Oct 29, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Oct 29, 2024

IanHoang moved this from 🆕 New to 🏗 In progress in Engineering Effectiveness Board Oct 29, 2024

IanHoang mentioned this issue Oct 29, 2024

[Bug]: Recall incorrect when do vector search with filter when ground truth contains -1 opensearch-project/opensearch-benchmark#654

Open

This was referenced Dec 5, 2024

Short term fix for Recall opensearch-project/opensearch-benchmark#703

Closed

Add vectorsearch README update regarding recall accuracy #516

Merged

IanHoang closed this as completed Jan 6, 2025

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347

[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347

VijayanB commented Jul 18, 2024

layavadi commented Jul 19, 2024

IanHoang commented Jul 24, 2024 •

edited

Loading

layavadi commented Jul 25, 2024

IanHoang commented Jul 25, 2024 •

edited

Loading

layavadi commented Jul 26, 2024 via email

IanHoang commented Aug 26, 2024

IanHoang commented Oct 2, 2024 •

edited

Loading

VijayanB commented Oct 25, 2024 •

edited

Loading

VijayanB commented Oct 25, 2024

IanHoang commented Oct 30, 2024 •

edited

Loading

IanHoang commented Nov 14, 2024

IanHoang commented Nov 15, 2024

IanHoang commented Dec 6, 2024

rishabh6788 commented Jan 6, 2025

IanHoang commented Jan 6, 2025

[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347

[BUG] Poor recall if search clients is greater than 40 for cohere-10m workload in vector search #347

Comments

VijayanB commented Jul 18, 2024

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

layavadi commented Jul 19, 2024

IanHoang commented Jul 24, 2024 • edited Loading

layavadi commented Jul 25, 2024

IanHoang commented Jul 25, 2024 • edited Loading

layavadi commented Jul 26, 2024 via email

IanHoang commented Aug 26, 2024

IanHoang commented Oct 2, 2024 • edited Loading

VijayanB commented Oct 25, 2024 • edited Loading

VijayanB commented Oct 25, 2024

IanHoang commented Oct 30, 2024 • edited Loading

Setup

Results

20 Search Clients from report

5 search clients from report

IanHoang commented Nov 14, 2024

IanHoang commented Nov 15, 2024

IanHoang commented Dec 6, 2024

rishabh6788 commented Jan 6, 2025

IanHoang commented Jan 6, 2025

IanHoang commented Jul 24, 2024 •

edited

Loading

IanHoang commented Jul 25, 2024 •

edited

Loading

IanHoang commented Oct 2, 2024 •

edited

Loading

VijayanB commented Oct 25, 2024 •

edited

Loading

IanHoang commented Oct 30, 2024 •

edited

Loading