Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] inner_hits in nested neural query should return all the chunks #2113

Open
yuye-aws opened this issue Sep 11, 2024 · 13 comments
Open
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement Roadmap:Vector Database/GenAI Project-wide roadmap label

Comments

@yuye-aws
Copy link
Member

yuye-aws commented Sep 11, 2024

What is the bug?

I am using text_chunking and text_embedding processor to ingest documents into an index. The text_chunking search example works well, but the inner_hits only returns a single element from the chunked string list. It does not matter when I set the score_mode to max or avg.

How can one reproduce the bug?

  1. Register a text embedding model.
  2. Create text chunking and embedding pipeline
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
  "description": "A text chunking and embedding ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "6ipW4JEBXVV1cW1lcFvy",
        "field_map": {
          "passage_chunk": "passage_chunk_embedding"
        }
      }
    }
  ]
}
  1. Create an index with mapping
PUT testindex
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "passage_text": {
        "type": "text"
      },
      "passage_chunk_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "name": "hnsw",
              "engine": "lucene"
            }
          }
        }
      }
    }
  }
}
  1. Ingest some sample documents into the index (run the following command two times).
POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
{
  "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}
  1. Search the index with nested neural query
GET testindex/_search
{
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "passage_chunk_embedding",
      "query": {
        "neural": {
          "passage_chunk_embedding.knn": {
            "query_text": "document",
            "model_id": "6ipW4JEBXVV1cW1lcFvy"
          }
        }
      },
      "inner_hits": {}
    }
  }
}
  1. Receive the search result
{
    "took": 1361,
    "timed_out": false,
    "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": {
        "value": 2,
        "relation": "eq"
      },
      "max_score": 0.02276505,
      "hits": [
        {
          "_index": "testindex",
          "_id": "7SqB4JEBXVV1cW1lKVvd",
          "_score": 0.02276505,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 1,
                  "relation": "eq"
                },
                "max_score": 0.02276505,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "7SqB4JEBXVV1cW1lKVvd",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.02276505,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "_index": "testindex",
          "_id": "7iqB4JEBXVV1cW1l5lv_",
          "_score": 0.02276505,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 1,
                  "relation": "eq"
                },
                "max_score": 0.02276505,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "7iqB4JEBXVV1cW1l5lv_",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.02276505,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }

What is the expected behavior?

The inner_hits should return matching score and offset of all the retrieved documents.

What is your host/environment?

Mac OS

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

@yuye-aws yuye-aws added bug Something isn't working untriaged labels Sep 11, 2024
@yuye-aws yuye-aws changed the title [BUG] inner_hits in nested neural query only returns one element [BUG] inner_hits in nested neural query only returns one chunk Sep 11, 2024
@yuye-aws
Copy link
Member Author

yuye-aws commented Sep 11, 2024

Neural search with explain is not working. I could not find a workaround.

@martin-gaievski
Copy link
Member

@yuye-aws Inner hits are not supported in hybrid query. There is a feature request for this (opensearch-project/neural-search#718), but at the moment there is no path forward

@yuye-aws
Copy link
Member Author

yuye-aws commented Sep 12, 2024

I'm not using hybrid query, just a plain neural query.

@yuye-aws
Copy link
Member Author

Are both features not supported due to the same blocking issue?

@martin-gaievski
Copy link
Member

martin-gaievski commented Sep 12, 2024

Sorry, my bad. Neural query is different, I'm not sure why nested doesn't work, in the code of neural we delegate execution to knn query, so you may want to check how it's done in knn. Easy test would be to try if plain knn query supports "nested" clause

@yuye-aws
Copy link
Member Author

Easy test would - try if plain knn query supports "nested" clause

Already tried in my fifth step.

@martin-gaievski
Copy link
Member

Easy test would - try if plain knn query supports "nested" clause

Already tried in my fifth step.

In step 5 you do have neural query. I mean the knn query, something like in following example but with nested:

"query": {
        "knn": {
            "embedding_field": {
                "vector": [
                    5.0,
                    4.0,
                    ....
                    3.8
                ],
                "k": 12
            }
        }
    }

@martin-gaievski
Copy link
Member

@yuye-aws I found this change in knn #1182, the essense of it is: in case of nested documents we need to return only one that gave the max score, and drop others. It became new default behavior instead of old one where all nested docs (meaning inner hits) are returned. From knn it's inherited by neural query.

@yuye-aws
Copy link
Member Author

in case of nested documents we need to return only one that gave the max score, and drop others. It became new default behavior instead of old one where all nested docs (meaning inner hits) are returned.

This does not make sense, because the score_mode can also be avg, where we expect to see all the scores.

@yuye-aws
Copy link
Member Author

From knn it's inherited by neural query.

Shall we make a PR to knn repo? After all, nested k-NN query also needs avg score mode.

@heemin32
Copy link
Collaborator

@yuye-aws Please add your use case and also suggestion if you have regarding avg score mode support in knn.
#1743

@yuye-aws
Copy link
Member Author

Replied in #1743 (comment). Also, resolving this issue can help resolve a user issue: opensearch-project/ml-commons#2612. I was considering to implement a new search response processor to retrieved most relevant chunks, but is fortunately blocked by the current issue: opensearch-project/ml-commons#2612 (comment)

@naveentatikonda naveentatikonda transferred this issue from opensearch-project/neural-search Sep 18, 2024
@naveentatikonda naveentatikonda added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged bug Something isn't working labels Sep 18, 2024
@naveentatikonda naveentatikonda changed the title [BUG] inner_hits in nested neural query only returns one chunk [FEATURE] inner_hits in nested neural query only returns one chunk Sep 18, 2024
@yuye-aws yuye-aws changed the title [FEATURE] inner_hits in nested neural query only returns one chunk [FEATURE] inner_hits in nested neural query should return all the chunks Sep 18, 2024
@hagen6835
Copy link

Would love this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement Roadmap:Vector Database/GenAI Project-wide roadmap label
Projects
Status: New
Status: 2.19.0
Development

No branches or pull requests

5 participants