Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue with elasticsearch extension to respect limit and minReleva… #935

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

basyonic
Copy link

…nce parameters when calling GetSimilarListAsync API

Motivation and Context (Why the change? What's the scenario?)

Fixing issue #934

High level description (Approach, Design)

using ElasticSearch Client .Size(limit) clause during query.
using .Similarity(min score) to fix minRelevance (after adjusting the score value based on cosine similarity as per https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-similarity-search)

…nce parameters when calling GetSimilarListAsync API
@basyonic
Copy link
Author

@microsoft-github-policy-service agree

@@ -218,15 +218,20 @@ public async Task<string> UpsertAsync(
Embedding embedding = await this._embeddingGenerator.GenerateEmbeddingAsync(text, cancellationToken).ConfigureAwait(false);
var coll = embedding.Data.ToArray();

//adjust min score for cosine similarity
//https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-similarity-search
float adjustedMinSimilarityScore = (float)(2 * minRelevance - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula transforms from score to cosine similarity, however internally ES works with score.

I believe this should be the inverse, e.g. float adjustedMinSimilarityScore = (float)((minRelevance + 1) / 2)

By the way, the formula (2 * minRelevance - 1) should be used in line 253, replacing hit.Score with the actual cosine similarity. The value returned is incorrect and doesn't match the similarity returned by other memory connectors.

Copy link
Author

@basyonic basyonic Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-similarity-search
default similarity metric is cosine and I didn't see something defines another similarity metric in the plugin code.
From my tests I found this score adjustment for cosine returns correct results.
please let me know if I'm missing something here or if you have another proposed tests to include..
A similarity value. This value determines the similarity metric used to score documents based on similarity between the query and document vector. For a list of available metrics, see the [similarity](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html#dense-vector-similarity) parameter documentation. **The similarity setting defaults to cosine**.

var resp = await this._client.SearchAsync<ElasticsearchMemoryRecord>(s =>
s.Index(index)
s.Index(index).Size(limit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qd.k(limit) already sets a limit, isn't this redundant? should we remove the limit param in qd.k(limit)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kNN search adds k matching documents to the search. So, if you set k=20, then it finds 20 matches. Then size takes the top scoring documents from these and returns them.
From my tests in real dataset. it was always returning 10 regardless of the k(limit) value. I think this is the default value of size parameter

@dluc dluc added the waiting for author Waiting for author to reply or address comments label Dec 17, 2024
@dluc dluc added the stale Inactive, possibly abandoned label Jan 29, 2025
@basyonic basyonic requested a review from dluc February 5, 2025 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Inactive, possibly abandoned waiting for author Waiting for author to reply or address comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants