Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]Efficient filtering on parent document with nested field #1356

Closed
heemin32 opened this issue Dec 20, 2023 · 4 comments
Closed

[FEATURE]Efficient filtering on parent document with nested field #1356

heemin32 opened this issue Dec 20, 2023 · 4 comments

Comments

@heemin32
Copy link
Collaborator

heemin32 commented Dec 20, 2023

Is your feature request related to a problem?
If efficient filter runs with nested field, the filter is applied for nested field but not parent field. For the filtering, most use cases are with parent field but not nested field.

What solution would you like?
I would like to filter the document on field in parent doc but not field in nested field.

What alternatives have you considered?
Post filtering on parent document

Do you have any additional context?

Create knn index with nested field.

PUT /knn
{
	"settings": {
		"index": {
			"knn": true,
			"knn.algo_param.ef_search": 100
		}
	},
	"mappings": {
		"properties": {
			"nested_field": {
				"type": "nested",
				"properties": {
					"my_vector1": {
						"type": "knn_vector",
						"dimension": 3,
						"method": {
							"name": "hnsw",
							"space_type": "l2",
							"engine": "faiss",
							"parameters": {
								"ef_construction": 128,
								"m": 24
							}
						}
					}
				}
			}
		}
	}
}

Ingest sample data

PUT /_bulk?refresh=true
{ "index": { "_index": "knn1", "_id": "1" } }
{"nested_field":[{"my_vector1":[1,1,1]},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}], "parking": "false"}
{ "index": { "_index": "knn1", "_id": "2" } }
{"nested_field":[{"my_vector1":[10,10,10]},{"my_vector1":[11,11,11]},{"my_vector1":[12,12,12]}], "parking": "true"}
{ "index": { "_index": "knn1", "_id": "3" } }
{"nested_field":[{"my_vector1":[1,1,1], "parking": "false"},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}]}
{ "index": { "_index": "knn1", "_id": "4" } }
{"nested_field":[{"my_vector1":[10,10,10], "parking": "true"},{"my_vector1":[11,11,11]},{"my_vector1":[12,12,12]}]}

Filter on field inside nested field.

GET knn/_search
{
	"query": {
		"nested": {
			"path": "nested_field",
			"query": {
				"knn": {
					"nested_field.my_vector1": {
						"vector": [
							1,
							1,
							1
						],
						"k": 2,
						"filter": {
							"bool": {
								"should": [
									{
										"term": {
											"nested_field.parking": "false"
										}
									}
								]
							}
						}
					}
				}
			}
		}
	}
}

Response

{
	"took": 17,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1,
			"relation": "eq"
		},
		"max_score": 0.0040983604,
		"hits": [
			{
				"_index": "knn1",
				"_id": "4",
				"_score": 0.0040983604,
				"_source": {
					"nested_field": [
						{
							"my_vector1": [
								10,
								10,
								10
							],
							"parking": "true"
						},
						{
							"my_vector1": [
								11,
								11,
								11
							]
						},
						{
							"my_vector1": [
								12,
								12,
								12
							]
						}
					]
				}
			}
		]
	}
}

Filter on field in top level

GET knn/_search
{
	"query": {
		"nested": {
			"path": "nested_field",
			"query": {
				"knn": {
					"nested_field.my_vector1": {
						"vector": [
							1,
							1,
							1
						],
						"k": 2,
						"filter": {
							"bool": {
								"should": [
									{
										"term": {
											"parking": "false"
										}
									}
								]
							}
						}
					}
				}
			}
		}
	}
}

Response

{
	"took": 7,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 0,
			"relation": "eq"
		},
		"max_score": null,
		"hits": []
	}
}
@navneet1v
Copy link
Collaborator

Will try to reproduce the issue and see what is causing this behavior

@navneet1v
Copy link
Collaborator

Was able to reproduce the issue and will add more details soon on the issue.

@navneet1v
Copy link
Collaborator

navneet1v commented Jan 4, 2024

Root Cause Analysis

To understand why we were seeing the above mentioned behavior we first need to understand how nested fields are indexed and nested queries work in Opensearch/Lucene.

The way Opensearch treats documents with nested field is, main document is broken in 2 parts, parent document and child documents. The parent document contains all the top level fields and nested fields are created as child documents.
As per the official documentation of Lucene (ref), parent and child documents are indexed as a single block. The block contains first the child documents and the end we add the parent document. To perform the query(where query is happening on child documents) we need to use ToParentBlockJoinQuery. This will ensure that if 1 or more child documents are matched with the query the docId of the parent document is returned.

Now, during the query execution if Opensearch identifies that this query is for child documents or this query may match the child documents Opensearch wraps the whole query in ToParentBlockJoinQuery. Ref1, Ref2

In efficient filtering(for both Lucene and Faiss), to get the filtered Ids, we create a new Filter query which has 2 conditions:

  1. FieldExist Query for vector field name
  2. The filter query provided by user.

Hence, when the updated filter query is run, the condition 1 will fail because vector field doesn’t exist on the parent documents as it in the child documents.

But when the user provided filter query is on nested documents then user provided query gets converted to ToParentBlockJoinQuery which ensures that right filtered documents are returned for doing further vector search.

Solution

The solution that we will be moving towards is:

We will identify if vector field provided in the query is nested or not.

  1. If it is nested then we will check if the filter clause might match any nested documents.
    1. if yes then we should just return the Filter query as Opensearch core will wrap the query correctly.
    2. if no, which means that filter is on the top level fields, then we will wrap the filter query with ToChildBlockJoinQuery. The main purpose of this query is user provide query matching parent documents and it joins down with child document. Which is exactly we need here, because filters are on the top level query clause and vectors are at child document level.
  2. Vector Field is no nested, then we will do nothing and follow the old flow.
final Query filterQuery = createQueryRequest.getFilter().get().toQuery(queryShardContext);
// If k-NN Field is nested field then parentFilter will not be null. This parentFilter is set by the
// Opensearch core. Ref PR: https://github.com/opensearch-project/OpenSearch/pull/10246
if(queryShardContext.getParentFilter() != null) {
    // if the filter is also a nested query clause then we should just return the same query without
    // considering it to join with the parent documents.
    if (new NestedHelper(queryShardContext.getMapperService()).mightMatchNestedDocs(filterQuery)) {
        return filterQuery;
    }
    // This condition will be hit when filters are getting applied on the top level fields and k-nn
    // query field is a nested field. In this case we need to wrap the filter query with
    // ToChildBlockJoinQuery to ensure parent documents which will be retrieved from filters can be
    // joined with the child documents containing vector field.
    return new ToChildBlockJoinQuery(createQueryRequest.getFilter().get().toQuery(queryShardContext),
            queryShardContext.getParentFilter());
}
return filterQuery;

PR: #1372

Test Plan

To make sure that changes are BWC and all the different permutation and combinations are take care we should test all these cases:

Sr No. Case Type KNN engine KNN Field Meta data field KNN Query Filter Query Manual Testing Status
1 Base Case Faiss Nested Nested Nested Query contains the nested field, but filter query has no nested field context Working
2   Faiss Nested Nested Nested Nested Query clause wrapped in Bool query Working
3   Faiss Nested Nested Nested nested query clause Working
               
4 Reported which was not working Faiss Nested Non Nested Nested Non Nested Working
               
5   Faiss Non Nested Nested Non Nested nested query clause Working
6   Faiss Non Nested Nested Non Nested Nested Query clause wrapped in Bool query Working
               
               
7 Base Case Faiss Non Nested Non Nested Non Nested Non Nested Working
               
               
8 Base Case Lucene Nested Nested Nested Query contains the nested field, but filter query has no nested field context Working
9   Lucene Nested Nested Nested Nested Query clause wrapped in Bool query Working
10   Lucene Nested Nested Nested nested query clause Working
               
11 Reported which was not working Lucene Nested Non Nested Nested Non Nested Working
               
12   Lucene Non Nested Nested Non Nested nested query clause Working
13   Lucene Non Nested Nested Non Nested Nested Query clause wrapped in Bool query Working
               
14 Base Case Lucene Non Nested Non Nested Non Nested Non Nested Working

Case 1

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "term": {
                                "nested_field.parking": "false"
                            }
                        }
                    }
                }
            }
        }
    }
}
{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "bool": {
                                "should": [
                                    {
                                        "term": {
                                            "nested_field.parking": "false"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 2

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "bool": {
                                "should": [
                                    {
                                        "nested": {
                                            "path": "nested_field",
                                            "query": {
                                                "term": {
                                                    "nested_field.parking": "false"
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 3

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "nested": {
                                "path": "nested_field",
                                "query": {
                                    "term": {
                                        "nested_field.parking": "false"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 4

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 1,
                        "filter": {
                            "bool": {
                                "should": [
                                    {
                                        "term": {
                                            "parking": "false"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 5

{
    "query": {
        "knn": {
            "my_vector1": {
                "vector": [
                    1,
                    1,
                    1
                ],
                "k": 2,
                "filter": {
                    "nested": {
                        "path": "nested_field",
                        "query": {
                            "term": {
                                "nested_field.parking": "false"
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 6

{
    "query": {
        "knn": {
            "my_vector1": {
                "vector": [
                    1,
                    1,
                    1
                ],
                "k": 1,
                "filter": {
                    "bool": {
                        "should": [
                            {
                                "nested": {
                                    "path": "nested_field",
                                    "query": {
                                        "term": {
                                            "nested_field.parking": "false"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

Case 7

{
    "query": {
        "knn": {
            "my_vector1": {
                "vector": [
                    1,
                    1,
                    1
                ],
                "k": 2,
                "filter": {
                    "term": {
                        "parking": "false"
                    }
                }
            }
        }
    }
}

Case 8

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "term": {
                                "nested_field.parking": "false"
                            }
                        }
                    }
                }
            }
        }
    }
}
{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "bool": {
                                "should": [
                                    {
                                        "term": {
                                            "nested_field.parking": "false"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 9

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "bool": {
                                "should": [
                                    {
                                        "nested": {
                                            "path": "nested_field",
                                            "query": {
                                                "term": {
                                                    "nested_field.parking": "false"
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 10

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "nested": {
                                "path": "nested_field",
                                "query": {
                                    "term": {
                                        "nested_field.parking": "false"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 11

{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 1,
                        "filter": {
                            "bool": {
                                "should": [
                                    {
                                        "term": {
                                            "parking": "false"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 12

{
    "query": {
        "knn": {
            "my_vector1": {
                "vector": [
                    1,
                    1,
                    1
                ],
                "k": 2,
                "filter": {
                    "nested": {
                        "path": "nested_field",
                        "query": {
                            "term": {
                                "nested_field.parking": "false"
                            }
                        }
                    }
                }
            }
        }
    }
}

Case 13

{
    "query": {
        "knn": {
            "my_vector1": {
                "vector": [
                    1,
                    1,
                    1
                ],
                "k": 1,
                "filter": {
                    "bool": {
                        "should": [
                            {
                                "nested": {
                                    "path": "nested_field",
                                    "query": {
                                        "term": {
                                            "nested_field.parking": "false"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

Case 14

{
    "query": {
        "knn": {
            "my_vector1": {
                "vector": [
                    1,
                    1,
                    1
                ],
                "k": 2,
                "filter": {
                    "term": {
                        "parking": "false"
                    }
                }
            }
        }
    }
}
{
    "properties": {
        "test_nested": {
            "type": "nested",
            "properties": {
                "test_vector": {
                    "type": "knn_vector",
                    "dimension": 1,
                    "method": {
                        "name": "hnsw",
                        "space_type": "l2",
                        "engine": "lucene"
                    }
                }
            }
        }
    }
}


@navneet1v
Copy link
Collaborator

Resolving this issue as the feature is merged and will be release in 2.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants