Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Search Semantic Chaining Mechanisms #12

Closed
YANG-DB opened this issue Sep 2, 2022 · 9 comments
Closed

[PROPOSAL] Search Semantic Chaining Mechanisms #12

YANG-DB opened this issue Sep 2, 2022 · 9 comments
Assignees
Labels

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 2, 2022

Relevancy rewriters and rankers mechanism

The purpose of this mechanism is to allow a concise and standard way of defining search relevancy occurring on both
query rewrite side and results ranking

This proposal is the collaboration of the

The capability of chaining multiple search relevancy rewriters and possibly results rerankers would allow the following :

  • Combine different aspect of relevancy rewriting into a single chain
  • Create a common standard for search relevancy related plugin components
  • Easily allow comparing query results under different ranking solutions
  • Simplify integrating such plugins into the search-relevancy dashboard using dedicated API

Chain Components

Chain operators
Each chain element is an operator which transforms the query content and send it upstream to the next operator - we will
call them Transformers.

The expectation from a transformer is to have no additional side-effects apart from the query transformation.

Chain payload
The chain's payload is the query itself. Each transformer is expected to transform the query in such a way that is
processable by the next transformer.

Chain termination step
The chain is terminated with a terminal step which is no longer emitting the query to upstream components of the chain.
This termination step is likely an actual execution of the query against the underlying search engine.

Chain footsteps
Once a chain is executing, it leaves a trail for each transformer that is operating in the form of specific train info.

Chain execution
The chain order will be defined as part of the query extension, if such definition is not found under the query
extension, the fallback will be the
specific query's index mapping definition of the rewriter (under the mapping's metadata)

Rewriter Transformations

The chain mechanism is actually a composition of query interceptors. These query interceptors purpose will be of
chaining the individual
query rewriter plugin one to the other in a sequential manner.

Rankers Transformations

The chain mechanism is terminated once a termination step is called. Such termination step is the ranker operator.
The ranker operator takes the query input and performs the actual query against the database and ranks the results
according to its own internal reasoning.

We currently don't support paging in the chaining termination step and therefore this step does not allow paging of
the results.

Configuration

Each transformation/operator may use the next levels of configuration:

  • Pluging level configuration
  • Index level configuration
  • Query level configuration

Pluging level configuration

This level of configuration is supported by the Plugin API of opensearch and may be used for static related
configuration of the component.
Implementation of this capability can make use of the BaseRestHandler endpoint extension mechanism.

For example querqy uses such endpoint for it's rewrite rules definition:

PUT /_plugins/_querqy/rewriter/common_rules

{
  "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
  "config": {
      "rules" : "request =>\nSYNONYM: GET"
  }
}

Index level configuration

This level of configuration is supported by the using the index mapping meta DSL which is an existing part of the
mapping DSL.
Example usage of the index mapping configuration:

New chain mapping DSL
For backwards compatibility we will use the index mapping **_meta **_field to preserve the configuration information
related both to the rewriters and rankers.

The chain parts will reside under the generic concepts:
** - rankers - **ranker list of plugins configuration
** - rewriters - **rewriter list of plugins configuration

Metadata under my_index/_mapping

{
  "_meta": {
    "rankers": [
      {
        "name": "kendra",
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

The order of the ranker/rewriter is explicit and the chain will dispatch accordingly (unless another directive appears
under the query chain-directive )

Query level configuration

This level of configuration is supported by using the query extension DSL. This section will have a new chain DSL
structure. In a similar manner to the _"meta" section of the mapping DSL, the "ext" will contain the rankers &
rewriters list.

Extension under _search

{
  "query": {
  },
  "ext": {
    "rewriters": [
      {
        "name": "querqy",
        "properties": {
          "querqy": {
            "matching_query": {
              "must_match": {
                "query": "rambo"
              },
              "multi_match": {
                "query": "rambo",
                "fields": [
                  "field1",
                  "field2"
                ]
              }
            },
            "query_fields": [
              "title^3.0",
              "brand^2.1",
              "shortSummary"
            ]
          }
        }
      }
    ],
    "rankers": [
      {
        "name": "kendra",
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

The order of the ranker/rewriter is explicit and the chain will dispatch accordingly (unless another directive appears

This is a flow chart visualization of the chain steps:

############                 ############             #############           #############
# _Search  #                 #  querqy  #             #  kendra   #           #  Results  #
#   -query #                 #  -rewrite#             #  -execute #           #    -   1  #
#      ... #   --------->    #     query#  ---------> #    search # --------->#    -   2  #   
#          #                 #          #             #  -rank    #           #    -   3  #
############                 #          #             #   results #           #    -   4  #
                             ############             #############           #############
                                                           /\
                                                           ||
                                                           || 
                                                           || 
                                                           || 
                                                           \/ 
                                                      ###############
                                                      # opensearch  #  
                                                      #  -run-query #   
                                                      ###############
                                                      

Chain Context

Search Relevancy Context Information
In order for the rewriter and ranker chain to be able to track and be informed of all the modifications each step is
performing an execution context is needed.

This context will have the next fields that can be applied to any future plugin that needs to perform rewrites or
ranking

  • context (information about the current execution parameters)
    • params section is an input to each and every ranker and rewriter that it may use it for its own needs*

      • query - the original query that is to be carried forward down the chain
    • execution (execution related content that is generated throughout the pipeline)

      • id auto-generated unique id describing the chain instance)
      • rewriters rewriter list of plugin query configuration
      • rankers ranker list of plugins query configuration
      • exclude remove rewriters/rankers that appear in the default index configuration

This execution section may have additional internal fields which are related to the execution flow itself and are
subject to future changes*


This context will be attached to the query DSL under the _ext section.

POST my_index/_search

{
  "query": {
    "match_all": {}
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "match_all": {}
        }
      }
    },
    "execution": {
      "id": "ABC123",
      "rewriters": [
        {
          "name": "querqy",
          "properties": {
            "querqy": {
              "matching_query": {
                "must_match": {
                  "query": "rambo"
                },
                "multi_match": {
                  "query": "rambo",
                  "fields": [
                    "field1",
                    "field2"
                  ]
                }
              },
              "query_fields": [
                "title^3.0",
                "brand^2.1",
                "shortSummary"
              ]
            }
          }
        }
      ],
      "rankers": [
        {
          "name": "kendra",
          "properties": {
            "title_fields": [
              "title"
            ],
            "body_fields": [
              "published",
              "description"
            ]
          }
        }
      ]
    }
  }
}

Activating Query rewriter / rerankers

During the lifetime of the index, once a query is running against an index - the following steps will occur:

  1. verify the index if search-relevancy activated

    1. create a chain flow control component which will drive the chain of rewriters & rerankers
      create the search-relevancy context information (or use existing one if such was created)
  2. for each rewrite step in the rewriters list :

    1. dispatch execution to the plugin
    2. plugin receives the params section as parameters
    3. plugin changes the query
    4. plugin may add additional information on its execution step under ext->context->rewriters->$name$->info
    5. returns execution to the chain flow control
  3. for each semantic-ranker step in the rankers list:

    1. dispatch execution to the plugin
    2. plugin receives the params section as parameters
    3. plugin performs the ranking logic
    4. returns newly ranked results to the caller

In case the rewriter/ranker doesn't appear in the query ext section, but it does appear in the relevant index **
mapping** section -
the configuration details from the index mapping section will be copied into the query relevant ext section.

To disable a rewriter/ranker from being activated on a query in cases where the index mapping indicate it is a part of
the chain,
add their name to exclude list under the execution section.

Example

Configuration Stage

Step 0: Create plugins configuration settings

PUT /_plugins/_querqy/rewriter

{
  "common_rules": [
    {
      "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
      "config": {
        "rules": "request =>\nSYNONYM: GET"
      }
    }
  ]
}

PUT /_plugins/_kendra

{
  "config": {
    "endpoint": [
      "127.0.0.1",
      "0.0.0.0"
    ]
  }
}

Step 1: Create mapping for index my_index

PUT my_index/_mapping

{
  "_meta": {
    "rankers": [
      {
        "nane":"kendra", "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

Query Stage

Step 2: original request from user : “rambo”

Step 2.1: Structured query from application coming to OpenSearch (this is done by the customer’s application)

POST my_index/_search

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "topic": "hobby"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "dateField": {
              "gte": "now-12d",
              "lte": "now-10d"
            }
          }
        }
      ]
    }
  }
}

The chain flow control intercepts the index search request and will dispatch the request for each the query rewriter

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "topic": "hobby"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "dateField": {
              "gte": "now-12d",
              "lte": "now-10d"
            }
          }
        }
      ]
    }
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      // this section is generated for the chain if not given by user 
      "execution": { 
        "id": "A1b2c", 
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              }
            }
          }
        ]
      }
    }
  }
}

Step 3: First rewriter (Querqy) is dispatched and generates the new query (query rewrite)

{
  "query": {
    //todo - put here the query after being re-written by querqy    
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      "execution": {
        "id": "A1b2c",
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              },
              "info" : { } // additional info that querqy may add after query rewrite
            }
          }
        ]
      }
    }
  }
}

Step 3: chain flow control has no additional rewrites to dispatch - so it will dispatch to the rankers. The first ranker in the chain will review the context params and take the necessary information .

After it will complete its action it will have the results ranked according to its internal reasoning

{
  "query": {
    //todo - put here the query after being re-written by querqy    
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      "execution": {
        "id": "A1b2c",
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              },
              "info" : { } 
            }
          }
        ]
      }
    }
  }
}

Response Stage


Step 4: Reranking work after the rewrite chain is completed - returning the results to the original calling service

ranker search results json

{
  "took" : 0,
  "timed_out" : false,
   "ext": {  // this ext section is suggested to be added here as part of the results.
     "context": {
       "params": {
         "query": {
           "bool": {
             "must": [
               {
                 "match": {
                   "topic": "hobby"
                 }
               }
             ],
             "filter": [
               {
                 "range": {
                   "dateField": {
                     "gte": "now-12d",
                     "lte": "now-10d"
                   }
                 }
               }
             ]
           }
         }
       },
       "execution": {
         "id": "A1b2c",
         "rankers": [
           {
             "name": "kendra",
             "properties": {
               "title_fields": [
                 "title"
               ],
               "body_fields": [
                 "published",
                 "description"
               ]
             }
           }
         ],
         "rewriters": [
           {
             "name": "querqy",
             "properties": {
               "query": {
                 "querqy": {
                   "matching_query": {
                     "query": "notebook"
                   },
                   "query_fields": [
                     "title^3.0",
                     "brand^2.1",
                     "shortSummary"
                   ]
                 }
               },
               "info" : { }
             }
           }
         ]
       }
     }
   },
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.8773359,
    "hits" : [
      {
        "_index" : "employees",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.8773359,
        "_source" : {
          "id" : 4,
          "name" : "Alan Thomas",
          "email" : "[email protected]",
          "gender" : "male",
          "ip_address" : "200.47.210.95",
          "date_of_birth" : "11/12/1985",
          "company" : "Yamaha",
          "position" : "Resources Manager",
          "experience" : 12,
          "country" : "China",
          "phrase" : "Emulation of roots heuristic coherent systems",
          "salary" : 300000
        }
      }
    ]
  }
}

The response DSL dosn't contain such ext part - this RFC is suggesting to add such a section to the results.

@macohen
Copy link
Collaborator

macohen commented Sep 27, 2022

Can you provide some examples of the problems this would solve at a high level in the summary? Some examples for what is described above the first horizontal line would help in attracting the right people to comment on this.

@macohen
Copy link
Collaborator

macohen commented Oct 3, 2022

In query stage 2.1, it says the user entered "rambo," but "rambo" is not mentioned again.

For this comment "// this section is generated for the chain if not given by user," when would the chain be given by the user other than the initial query?

How does this all compare to how search works today as opensearch passes through analyzers? "We currently don't support paging in the chaining termination step and therefore this step does not allow paging of the results." Can you provide a reference to what is doing this today?

@anirudha
Copy link
Contributor

anirudha commented Oct 3, 2022

Client and server-side log tracing
#7
#8

@mashah
Copy link

mashah commented Oct 6, 2022

I'm a bit lost as I'm picking this back up again. I've now seen multiple examples of chaining in both query rewriting and ranking. So, I've recanted some of my earlier complaints.

With that said, I would like to understand where we are in staging the work here, so that we can push items out incrementally.

@macohen
Copy link
Collaborator

macohen commented Oct 6, 2022

I'm not sure what the previous complaints were so I may be missing some context. This is not yet scheduled for development. We're working on the roadmap for search relevance now and could use help from the community in prioritization. One piece of the chain that could be useful sooner rather than later would be to allow the owners of the search application to pass the original user query without any rewriting through to OpenSearch. This could feed logging and inform internal search analytics (top queries, zero results queries, etc.). We think working on that as a first piece along with the remote ranker plug-in would be good progress. Are you considering working on any of this/looking for a breakdown to pick up something?

@msfroh
Copy link
Collaborator

msfroh commented Oct 27, 2022

I was chatting w/ @mahitamahesh about what this might look like in terms of transforming both requests and results (which I think is the appropriate generalization of rewriters/rerankers), and how we might incorporate an idea of "stored, named chains" to simplify e.g. A/B testing between two chains before making one the index default chain.

Here are some example calls that we discussed:

PUT /search_configurations/my_new_awesome_config
{
  "request_transformers": [
      ...
  ],
  "result_transformers": [
      ...
  ]
}
POST /my-index/_search 
{
  "query": {
     "match" : {
        "text": "matching on some text"
     }
  },
  "ext" : {
    // Use a named search config
    "search_configuration" : "my_new_awesome_config"
  }
}
POST /my-index/_search 
{
  "query": {
     "match" : {
        "text": "matching on some text"
     }
  },
  "ext" : {
    "search_configuration" : {
        // ... use an inline search config ...
        "request_transformers" : [
            ...
        ],
        "result_transformers" : [
           ...
        ]
    }
  }
}
PUT /my-index/settings
{
  // Not constrained by limitations of index settings API,
  // because we're just pointing to a named search config.
  "index.search_configuration.default" : "my_new_awesome_config"
}

@jmazanec15
Copy link
Member

Hi @msfroh, search configurations seem like they could be a very useful generalization. I am wondering how general search_configurations would be, or if they are meant to specifically store information for chaining only.

Specifically, I am working on opensearch-project/neural-search#70 for the neural search plugin where we want to associate model_id's with fields so that users do not have to pass in the model ids for each search request - rather, the information is associated with the index instead. In other words, I want to store a map like this with the index to be used at search time:

{
 "neural_search.model_ids": {
    "field_1": "model_id_1",
    "field_2": "model_id_2",
    "field_3": "model_id_2",
    ...
  }
}

I thought about storing this with the _meta field, as was done in the original proposal of this, however, I worry this would potentially conflict with users storing their own application specific metadata in this field. Alternative to this, I thought about a system index, but this seems like it would be pretty heavy to just store a map.

That being said, it seems like a search configuration might be a good place to store a mapping like this and associate it with an index via index setting.

Would it make sense to make search configurations extensible to store information other than chains that could be used at different stages throughout search phases?

@navneet1v
Copy link

Would it make sense to make search configurations extensible to store information other than chains that could be used at different stages throughout search phases?

This seems to be good extensibility that can be used for other plugins like Neural search. +1 to jack comment. @msfroh can we make it extensible so that it can be used outside this plugin

@msfroh msfroh removed the untriaged label Jan 9, 2023
@msfroh msfroh self-assigned this Jan 9, 2023
@msfroh msfroh mentioned this issue Jan 11, 2023
1 task
@macohen macohen added the Search label Mar 10, 2023
@macohen
Copy link
Collaborator

macohen commented Sep 1, 2023

Closing as Search Pipelines has gone GA. Thanks, @YANG-DB!

@macohen macohen closed this as completed Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants