-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] reranking does not work for nested fields #657
Comments
@HenryL27 can you look into this? I am unable to assign this issue to you. |
Following the same steps, I get responses: // query_text = "text"
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": -4.130314,
"hits": [
{
"_index": "test_nested",
"_id": "1",
"_score": -4.130314,
"_source": {
"content": [
{
"text": "good morning",
"page": 1
},
{
"text": "good evening",
"page": 2
}
]
}
},
{
"_index": "test_nested",
"_id": "2",
"_score": -4.130314,
"_source": {
"content": [
{
"text": "studying math in my room",
"page": 1
},
{
"text": "playing playstation with my family",
"page": 2
}
]
}
}
]
},
"profile": {
"shards": []
}
}
// query_text = "math"
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": -1.7839141,
"hits": [
{
"_index": "test_nested",
"_id": "1",
"_score": -1.7839141,
"_source": {
"content": [
{
"text": "good morning",
"page": 1
},
{
"text": "good evening",
"page": 2
}
]
}
},
{
"_index": "test_nested",
"_id": "2",
"_score": -1.7839141,
"_source": {
"content": [
{
"text": "studying math in my room",
"page": 1
},
{
"text": "playing playstation with my family",
"page": 2
}
]
}
}
]
},
"profile": {
"shards": []
}
} So it would appear that it is doing something, although since the scores are the same for all the docs it's probably not doing the right thing. Nested fields are weird. @asfoorial what is the expected behavior in this case? |
Sorry, I had a typo query_text="hello". It should return the document with id=1 as the first. query_text="math" should return document with id=2 as the first. Remember, rerank is expected to rerank documents based on "content.text" nested field with respect to query_text. So rerank should basically take the resulting documents from kNN search and rerank their content.text values. The scores should reflected to their corresponding documents. |
Rerank is a crucial operation that brings a drastic improvement to search accuracy. Indexing each sentence as a document can occupy more storage space (especially if there is a lot of metadata per document) and takes longer indexing time (can be more than 20%) compared to nested kNN. So, having rerank working for nested fields is expected to bring major improvement to the overall experience in OpenSearch. |
It appears that reranking also doesn't work on fields inside of objects, e.g.
|
I will give it a try. If this is a bug, is there any chance it will be fixed in OS 2.13.0? |
Unlikely. the fix seems nontrivial and 2.13 releases tomorrow. |
Yes I am getting the same error. |
Looking at the code, I noticed that the below method is checking fields and source but does not check the source in inner hits which is what we need. In fact rerank should be applied to inner hits so that if we return inner_hits they should be returned in a reranked order and the rerank score should be reflected in the parent document. In a practical scenario, I would have PDF documents that I index into OpenSearch documents. The content of each document will be converted into text chunks (sentences or paragraphs) and then saved into nested kNN in the document. Now, the user would be interested to find the most relevant PDF document and also the most relevant chunk in that document to an input query. So if I search for "math" then the result should have document id=2 as the first result and in the inner hits should have the { "text": "studying math in my room", "page": 1 } as the first element in the inner hits since it is the most relevant chunk. private String contextFromSearchHit(final SearchHit hit, final String field) { |
Hey @asfoorial, sorry it's taken me so long to get to this. I'm starting work on implementing a nested context source fetcher and I wanted to run my design by you to make sure it satisfies your requirements.
{
"rerank": {
"ml_opensearch": {
"model_id": "ael;rjthnerotphiwntr"
},
"context": {
"nested_document_field": "content.text"
}
}
} Another alternative is to allow the nested fetcher to accept a list of fields and concatenate them, as with the regular document field fetcher. This gets complicated really quickly when you add more than one nested field, (cartesian products are hard to sort), so we would limit it to one nested field in the list of fields.
How does that sound? Also do you anticipate a need to rerank based on deeper nestings than one layer or can I impose another limit there? At a certain point I worry about performance |
@HenryL27 can we use the same document context source and if string contains . in field name then its nested field. That us how opensearch also handle nested fields. Only limitation I know for this is if its an array field. |
@navneet1v certainly, fixing general xpaths was the first thing I did yesterday. {
"_source": {
"inner": [
{ "inner2": [
{ "text": "hello" },
{ "text": "world" }
]},
{ "inner2": [
{ "text": "goodbye" },
{ "text": "moon" }
]}
]
}
} |
@HenryL27 thank you and the team for returning back to me on this issue. Yes the first open will satisfy my requirement. I also recommend keeping it simple with one nested field and one nesting level. In addition, it would be also good to allow returning inner hits with sub-documents sorted based on the reranking result. One question about this though, will there be multiple calls to the cross encoder model (a call per document?) I am only concerned about the performance here. One last thing, current I use a truncate processor in the search pipeline to prevent users from abusing ml nodes with reranking over large number of hits at a time. This works well for normal fields. But how would work for nested fields? And is there a way have a similar truncate behavior done on nested fields? Thanks |
Music to my ears! We'll try to batch the inferences, although your main source of latency is gonna be based on the number of things you're reranking, not the number of inference invocations/network calls. If we want to re-score all the nests then we need to re-score all the nests. I don't think the truncation processor works on nested fields, but I could be wrong. And as far as I'm aware there isn't one that works specifically on nests... so this will potentially make your system a little abuseable, although I'm not sure of the utility of this kind of abuse apart from a malicious attack. But maybe a nested truncation feature request is in order? And yes, inner sorting should be easy. |
@HenryL27 are there any plans to pick this up? |
Hi all,
How to use cross-encoder model to rerank a nested text field?
I tried the below but it seems like rerank does not have any effect!
PUT /_search/pipeline/ms-marco-MiniLM-L-12-v2_rerank
{
“response_processors”: [
{
“rerank”: {
“ml_opensearch”: {
“model_id”: “zyfmI44BdjEy2-xVoJ0V”
},
“context”: {
“document_fields”: [ “content.text”]
}
}
}
]
}
PUT test_nested
{
“mappings”: {
“properties”: {
“content”: {
“type”: “nested”,
“properties”: {
“text”:{
“type”:“text”
},
“page”:{
“type”:“long”
}
}
}
}
}
}
PUT test_nested/_doc/1
{
“content”:[{“text”:“good morning”,“page”: 1},
{“text”:“good evening”,“page”: 2}
]
}
PUT test_nested/_doc/2
{
“content”:[{“text”:“studying math in my room”,“page”: 1},
{“text”:“playing playstation with my family”,“page”: 2}
]
}
GET test_nested/_search?search_pipeline=ms-marco-MiniLM-L-12-v2_rerank
{
“query”: {
“match_all”: {}
},
“ext”: {
“rerank”: {
“query_context”: {
“query_text”: “text"
}
}
}
}
The text was updated successfully, but these errors were encountered: