Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Introduce a rescore search response processor #15631

Closed
mingshl opened this issue Sep 3, 2024 · 11 comments · Fixed by opensearch-project/neural-search#932
Closed
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc v2.18.0 Issues and PRs related to version 2.18.0

Comments

@mingshl
Copy link
Contributor

mingshl commented Sep 3, 2024

Is your feature request related to a problem? Please describe

When a field with score is added by a response processor, we try to sort the response by the added field.

in the search response, there are multiple documents in hits. a rescore search response processor can take the score from a field from the document, and sort the search hits based on the field value. Users can opt to remove the field in the documents.

Describe the solution you'd like

The proposed solution as follows:

PUT /_search/pipeline/my_pipeline
{
  "response_processors": [
    {
      "rescore": {
        "field": "similarity_score",
        "sort": "desc",
        "replace_score": true 
        "remove_sort_field" : true
      }
    }
  ]
}

for example:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "diary_index",
        "_id": "1",
        "_score": 1,
        "_source": {
          "diary": "how are you",
          "similarity_score": -11.055182
        }
      },
      {
        "_index": "diary_index",
        "_id": "2",
        "_score": 1,
        "_source": {
          "diary": "today is sunny",
          "similarity_score": 8.969885
        }
      },
      {
        "_index": "diary_index",
        "_id": "3",
        "_score": 1,
        "_source": {
          "diary": "today is july fifth",
          "similarity_score": -5.736348
        }
      },
      {
        "_index": "diary_index",
        "_id": "4",
        "_score": 1,
        "_source": {
          "diary": "it is winter",
          "similarity_score": -10.045217
        }
      }
    ]
  }
}

the response after processing:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [

{
        "_index": "diary_index",
        "_id": "2",
        "_score": 8.969885,
        "_source": {
          "diary": "today is sunny"
        }
      },
      {
        "_index": "diary_index",
        "_id": "3",
        "_score": -5.736348,
        "_source": {
          "diary": "today is july fifth"
        }
      },
      {
        "_index": "diary_index",
        "_id": "4",
        "_score": -10.045217,
        "_source": {
          "diary": "it is winter"
        }
      }
      , {
        "_index": "diary_index",
        "_id": "1",
        "_score": -11.055182,
        "_source": {
          "diary": "how are you"
        }
      }
    ]
  }
}

Related component

Search

Describe alternatives you've considered

No response

Additional context

No response

@mingshl mingshl added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 3, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Sep 3, 2024
@mch2
Copy link
Member

mch2 commented Sep 4, 2024

tagging @msfroh @martin-gaievski

@msfroh
Copy link
Collaborator

msfroh commented Sep 12, 2024

Makes sense -- if a field was added by response processor, the only way that you can really do something with it (before sending back to the client) is with another response processor.

With the ever-present "naming is hard" challenge, I'm not sure I like the name rescore. Maybe sort? Or sort_by_field?

@msfroh
Copy link
Collaborator

msfroh commented Sep 12, 2024

In theory, if we really wanted to get into the "do one thing" philosophy for search processors, then we could use the remove_field response processor to remove the field after sorting, rather than including it in this processor.

Similarly/alternatively, we could add a copy response processor that copies the field value into _score. (That could also be done with a shared copy document processor, if we ever get those.) Of course, the copy ingest processor already does two things by having the remove_source parameter.

@mingshl
Copy link
Contributor Author

mingshl commented Sep 12, 2024

Makes sense -- if a field was added by response processor, the only way that you can really do something with it (before sending back to the client) is with another response processor.

With the ever-present "naming is hard" challenge, I'm not sure I like the name rescore. Maybe sort? Or sort_by_field?

there is a sort response processor already, but it's sorting a array field in the document.

yep, naming is hard. sort_by_field is straight forward but a bit long.

@brianf-aws
Copy link

brianf-aws commented Sep 19, 2024

Hey y'all, I would like to work on this feature! I like Rescore since its taking a already scored value e.g. _score": 1 and changing it to something more relevant. The closest thing I can think of is refineScore as a name.

I have some questions I want to bring to the table:

  1. What do we do with Tiebreakers? Should we use First item in with same score to be shown earlier?
  2. How do we want to handle failure? Should we just return the values that were given before hand?
    • One thing I was thinking is when the field provided doesn't exist Do we return the previous response or stop the pipeline
  3. What are some ways that users can mishandle the processor? _I'm new to crafting processors but would like some context
  4. What do we do when the field to sort by field is not a singleton number?

@brianf-aws
Copy link

brianf-aws commented Sep 24, 2024

Adding this META GH enhancement for ML Inference

@brianf-aws
Copy link

brianf-aws commented Sep 28, 2024

I did some research and I think we might be able to leave out the sorting functionality as there exists a sort results functionality within OS core you can see here , this feature allows you to search within the search hits.

In essence this allows you to define a search pipeline with the rescore processor and then after you are done you could sort by the _score field. I feel that more discussion is needed though.

Here is an example

GET /my-index/_search?search_pipeline=book_pipeline
{
  "sort": [
    {
      "rating": {
        "order": "desc"
      }
    }
  ]
}

@mingshl
Copy link
Contributor Author

mingshl commented Sep 28, 2024

I did some research and I think we might be able to leave out the sorting functionality as there exists a sort results functionality within OS core you can see here , this feature allows you to search within the search hits.

In essence this allows you to define a search pipeline with the rescore processor and then after you are done you could sort by the _score field. I feel that more discussion though.

Here is an example


GET /my-index/_search?search_pipeline=book_pipeline

{

  "sort": [

    {

      "rating": {

        "order": "desc"

      }

    }

  ]

}

Good that you found out the sort function in the query. The intention for rescore and sort is that the new score (rather than BM25 score) is added during the search response processors, then we can utilize the new score field (which not exists in the document either) then rescore processor can be helpful for ranking the hits and replace the scores.

@mingshl mingshl moved this from 🆕 New to Now(This Quarter) in Search Project Board Sep 30, 2024
@mingshl mingshl added the v2.18.0 Issues and PRs related to version 2.18.0 label Sep 30, 2024
@mingshl mingshl moved this from New to In Progress in OpenSearch Roadmap Sep 30, 2024
@mingshl mingshl added the Roadmap:Vector Database/GenAI Project-wide roadmap label label Sep 30, 2024
@brianf-aws
Copy link

After discussing (offline) we have some suggestions to this processor

  • Naming it rescore maybe a bit misleading we had devs thinking this was applying an algorithm in reality it was not. Here are some names they thought of ScoreAdjustmentProcessor, ScoreModifierProcessor, ScoreUpdaterProcessor, RankOverProcessor , RefineScoreProcessor, ReorderProcessor.
  • deciding to keep only field_to_sort_by as a mandatory field and give the possibility for the sort order to be null this gives the flexibility for the user to apply the sorting on the search query as mentioned above.
    • Given that we are leaving this field optional we will have to think about renaming the fields as well such as field_to_sort_by ->target_field
  • Add a manual check to see if (replaceScore: False, remove_sort_field: True) because in this scenario the user deletes a field that the result is sorted by but doesnt know why as the field is removed. When this occurs we will throw an exception, because its possible for the user to turn on remove_sort_field because the default is set to false

@ohltyler
Copy link
Member

@brianf-aws / @mingshl can this be resolved?

@brianf-aws
Copy link

Yes we can close it as this was implemented here opensearch-project/neural-search#932

@mingshl mingshl closed this as completed Nov 19, 2024
@github-project-automation github-project-automation bot moved this from Now(This Quarter) to ✅ Done in Search Project Board Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc v2.18.0 Issues and PRs related to version 2.18.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants