Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Exact Search to return only K results and added client side latency metric for query Benchmarks #933

Conversation

navneet1v
Copy link
Collaborator

Description

Improved Exact Search to return only K results and added client side latency metric for query Benchmarks

Issues Resolved

#903

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -472,7 +475,7 @@ def _action(self):
return results

def _get_measures(self) -> List[str]:
measures = ['took', 'memory_kb']
measures = ['took', 'memory_kb', 'client_time']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to have a unit as well. client_time_millis or client_time_seconds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we use constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I don't see constant will make a lot of sense. Plus these are benchmark scripts, so I don't see a big advantage here.

For time unit, we don't do it for other metrics. Hence I feel for consistency we should keep it. Everything is already in milliseconds only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as its the same unit as took, thats fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its the same unit.

float score = spaceType.getVectorSimilarityFunction().compare(queryVector, vector);
docToScore.put(docId, score);
if(score > topDoc.score) {
topDoc.score = score;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic.
You are updating topDoc with new score and new docId. Which means, the queue will replace current top document with new top document. Instead shouldn't we remove the bottom doc and add the new topDoc in the queue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a min heap and not max heap. Plus we did the init of the heap. The init of the heap will set all the values as DocId: MAX_DOC_ID and Score as -INF. Now, everytime we update the top, the top element will have score of -INF.

I will add comments around this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. So we basically need to keep track of the worst result we have come across so far so that we know which one to replace.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct. This is what min heap does for you.

}
}
// If scores are negative we will remove them
while (queue.size() > 0 && queue.top().score < 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If queue.top() score is negative, doesn't that mean every score in the queue is negative? Then, we can simply empties it instead of looping through it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that heap is init with all values having score of -INF. So in case filterIds < k, some ids in the heap can have -INF value. Hence we need to remove them.

@navneet1v navneet1v force-pushed the feature/faiss-filtering branch from 7e67a34 to 93c7595 Compare June 9, 2023 22:09
if(score > topDoc.score) {
topDoc.score = score;
topDoc.doc = docId;
// As the HitQueue is min heap, updating top will bring the doc with -INF score value on top.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just -INF - the lowest score in the queue, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the lowest score too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update the documentation

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Might want to add a few comments around heap operations.

…latency metric for query Benchmarks

Signed-off-by: Navneet Verma <[email protected]>
@navneet1v navneet1v force-pushed the feature/faiss-filtering branch from 93c7595 to 825dbc9 Compare June 9, 2023 23:05
@navneet1v navneet1v merged commit f5ff953 into opensearch-project:feature/faiss-filtering Jun 9, 2023
@navneet1v
Copy link
Collaborator Author

Looks good to me. Might want to add a few comments around heap operations.

updated

navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…latency metric for query Benchmarks (opensearch-project#933)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…latency metric for query Benchmarks (opensearch-project#933)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…es include

 * Enabled the efficient filtering support for Faiss Engine (opensearch-project#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (opensearch-project#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (opensearch-project#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (opensearch-project#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (opensearch-project#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…es include

 * Enabled the efficient filtering support for Faiss Engine (opensearch-project#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (opensearch-project#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (opensearch-project#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (opensearch-project#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (opensearch-project#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…es include

 * Enabled the efficient filtering support for Faiss Engine (opensearch-project#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (opensearch-project#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (opensearch-project#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (opensearch-project#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (opensearch-project#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…es include

 * Enabled the efficient filtering support for Faiss Engine (opensearch-project#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (opensearch-project#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (opensearch-project#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (opensearch-project#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (opensearch-project#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit that referenced this pull request Jun 14, 2023
…es include (#936)

* Enabled the efficient filtering support for Faiss Engine (#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jun 14, 2023
…es include (opensearch-project#936)

* Enabled the efficient filtering support for Faiss Engine (opensearch-project#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (opensearch-project#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (opensearch-project#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (opensearch-project#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (opensearch-project#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit to navneet1v/k-NN that referenced this pull request Jul 10, 2023
…es include (opensearch-project#936)

* Enabled the efficient filtering support for Faiss Engine (opensearch-project#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (opensearch-project#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (opensearch-project#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (opensearch-project#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (opensearch-project#934)

Signed-off-by: Navneet Verma <[email protected]>
navneet1v added a commit that referenced this pull request Jul 10, 2023
…es include (#936)

* Enabled the efficient filtering support for Faiss Engine (#907)
 * Fixed the ef_search default value for faiss HNSW with filters and updated the perf-tool to include Faiss HNSW tests (#926)
 * Added exact search for cases when filteredIds < k to improve the recall for exact search (#928)
 * Improved Exact Search to return only K results and added client side latency metric for query Benchmarks (#933)
 * Added Integration Tests and Unit test for Efficient Filtering for Faiss Engine (#934)

Signed-off-by: Navneet Verma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants