Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] reranking does not work for nested fields #657

Open
asfoorial opened this issue Mar 29, 2024 · 15 comments
Open

[BUG] reranking does not work for nested fields #657

asfoorial opened this issue Mar 29, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@asfoorial
Copy link

asfoorial commented Mar 29, 2024

Hi all,

How to use cross-encoder model to rerank a nested text field?

I tried the below but it seems like rerank does not have any effect!

PUT /_search/pipeline/ms-marco-MiniLM-L-12-v2_rerank
{
“response_processors”: [
{
“rerank”: {
“ml_opensearch”: {
“model_id”: “zyfmI44BdjEy2-xVoJ0V”
},
“context”: {
“document_fields”: [ “content.text”]
}
}
}
]
}

PUT test_nested
{
“mappings”: {
“properties”: {
“content”: {
“type”: “nested”,
“properties”: {
“text”:{
“type”:“text”
},
“page”:{
“type”:“long”
}
}
}
}
}
}

PUT test_nested/_doc/1
{
“content”:[{“text”:“good morning”,“page”: 1},
{“text”:“good evening”,“page”: 2}
]
}

PUT test_nested/_doc/2
{
“content”:[{“text”:“studying math in my room”,“page”: 1},
{“text”:“playing playstation with my family”,“page”: 2}
]
}

GET test_nested/_search?search_pipeline=ms-marco-MiniLM-L-12-v2_rerank
{
“query”: {
“match_all”: {}
},
“ext”: {
“rerank”: {
“query_context”: {
“query_text”: “text"
}
}
}
}

@asfoorial asfoorial added bug Something isn't working untriaged labels Mar 29, 2024
@navneet1v
Copy link
Collaborator

navneet1v commented Apr 1, 2024

@HenryL27 can you look into this? I am unable to assign this issue to you.

@HenryL27
Copy link
Contributor

HenryL27 commented Apr 1, 2024

Following the same steps, I get responses:

// query_text = "text"
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": -4.130314,
    "hits": [
      {
        "_index": "test_nested",
        "_id": "1",
        "_score": -4.130314,
        "_source": {
          "content": [
            {
              "text": "good morning",
              "page": 1
            },
            {
              "text": "good evening",
              "page": 2
            }
          ]
        }
      },
      {
        "_index": "test_nested",
        "_id": "2",
        "_score": -4.130314,
        "_source": {
          "content": [
            {
              "text": "studying math in my room",
              "page": 1
            },
            {
              "text": "playing playstation with my family",
              "page": 2
            }
          ]
        }
      }
    ]
  },
  "profile": {
    "shards": []
  }
}
// query_text = "math"
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": -1.7839141,
    "hits": [
      {
        "_index": "test_nested",
        "_id": "1",
        "_score": -1.7839141,
        "_source": {
          "content": [
            {
              "text": "good morning",
              "page": 1
            },
            {
              "text": "good evening",
              "page": 2
            }
          ]
        }
      },
      {
        "_index": "test_nested",
        "_id": "2",
        "_score": -1.7839141,
        "_source": {
          "content": [
            {
              "text": "studying math in my room",
              "page": 1
            },
            {
              "text": "playing playstation with my family",
              "page": 2
            }
          ]
        }
      }
    ]
  },
  "profile": {
    "shards": []
  }
}

So it would appear that it is doing something, although since the scores are the same for all the docs it's probably not doing the right thing. Nested fields are weird. @asfoorial what is the expected behavior in this case?

@asfoorial
Copy link
Author

Sorry, I had a typo query_text="hello". It should return the document with id=1 as the first.

query_text="math" should return document with id=2 as the first.

Remember, rerank is expected to rerank documents based on "content.text" nested field with respect to query_text. So rerank should basically take the resulting documents from kNN search and rerank their content.text values. The scores should reflected to their corresponding documents.

@asfoorial
Copy link
Author

asfoorial commented Apr 1, 2024

Rerank is a crucial operation that brings a drastic improvement to search accuracy. Indexing each sentence as a document can occupy more storage space (especially if there is a lot of metadata per document) and takes longer indexing time (can be more than 20%) compared to nested kNN. So, having rerank working for nested fields is expected to bring major improvement to the overall experience in OpenSearch.

@HenryL27
Copy link
Contributor

HenryL27 commented Apr 1, 2024

It appears that reranking also doesn't work on fields inside of objects, e.g. properties.title. Pretty sure this code is the culprit. Could you confirm that you have this warning message in your logs?

Could not find field content.text in document 1 for reranking! Using the empty string instead.

@asfoorial
Copy link
Author

I will give it a try. If this is a bug, is there any chance it will be fixed in OS 2.13.0?

@HenryL27
Copy link
Contributor

HenryL27 commented Apr 1, 2024

Unlikely. the fix seems nontrivial and 2.13 releases tomorrow.

@asfoorial
Copy link
Author

It appears that reranking also doesn't work on fields inside of objects, e.g. properties.title. Pretty sure this code is the culprit. Could you confirm that you have this warning message in your logs?

Could not find field content.text in document 1 for reranking! Using the empty string instead.

Yes I am getting the same error.

@asfoorial
Copy link
Author

Looking at the code, I noticed that the below method is checking fields and source but does not check the source in inner hits which is what we need. In fact rerank should be applied to inner hits so that if we return inner_hits they should be returned in a reranked order and the rerank score should be reflected in the parent document. In a practical scenario, I would have PDF documents that I index into OpenSearch documents. The content of each document will be converted into text chunks (sentences or paragraphs) and then saved into nested kNN in the document. Now, the user would be interested to find the most relevant PDF document and also the most relevant chunk in that document to an input query. So if I search for "math" then the result should have document id=2 as the first result and in the inner hits should have the { "text": "studying math in my room", "page": 1 } as the first element in the inner hits since it is the most relevant chunk.

private String contextFromSearchHit(final SearchHit hit, final String field) {
if (hit.getFields().containsKey(field)) {
Object fieldValue = hit.field(field).getValue();
return String.valueOf(fieldValue);
} else if (hit.hasSource() && hit.getSourceAsMap().containsKey(field)) {
Object sourceValue = ObjectPath.eval(field, hit.getSourceAsMap());
return String.valueOf(sourceValue);
} else {
log.warn(
String.format(
Locale.ROOT,
"Could not find field %s in document %s for reranking! Using the empty string instead.",
field,
hit.getId()
)
);
return "";
}
}

@HenryL27
Copy link
Contributor

HenryL27 commented May 7, 2024

Hey @asfoorial, sorry it's taken me so long to get to this. I'm starting work on implementing a nested context source fetcher and I wanted to run my design by you to make sure it satisfies your requirements.

  1. We will implement a new context source fetcher, nested_document_field that does nested stuff:
{
  "rerank": {
    "ml_opensearch": {
      "model_id": "ael;rjthnerotphiwntr"
    },
    "context": {
      "nested_document_field": "content.text"
    }
  }
}

Another alternative is to allow the nested fetcher to accept a list of fields and concatenate them, as with the regular document field fetcher. This gets complicated really quickly when you add more than one nested field, (cartesian products are hard to sort), so we would limit it to one nested field in the list of fields.

  1. At rerank time, we'll rescore every sub-document, sort within each document, and then take the max sub-score as the overall document score and re-sort the top level docs.

How does that sound?

Also do you anticipate a need to rerank based on deeper nestings than one layer or can I impose another limit there? At a certain point I worry about performance

@navneet1v
Copy link
Collaborator

@HenryL27 can we use the same document context source and if string contains . in field name then its nested field. That us how opensearch also handle nested fields.

Only limitation I know for this is if its an array field.

@HenryL27
Copy link
Contributor

HenryL27 commented May 7, 2024

@navneet1v certainly, fixing general xpaths was the first thing I did yesterday.
I this issue is specifically about nested arrays - see the index mappings at the beginning of this thread - so the array stuff is precisely what we need to solve, right?
I suppose there is a solution where the logic to deal with nested arrays is just a part of the document context fetcher. You have to impose the limitation of one nested field as in the "alternative solution" above, or just forgo inner-hit sorting altogether. And we still have the question of doubly-/triply-/etc nested documents. e.g.

{
  "_source": {
    "inner": [
      { "inner2": [
        { "text": "hello" },
        { "text": "world" }
      ]},
      { "inner2": [
        { "text": "goodbye" },
        { "text": "moon" }
      ]}
    ]
  }
}

@asfoorial
Copy link
Author

@HenryL27 thank you and the team for returning back to me on this issue. Yes the first open will satisfy my requirement. I also recommend keeping it simple with one nested field and one nesting level.

In addition, it would be also good to allow returning inner hits with sub-documents sorted based on the reranking result. One question about this though, will there be multiple calls to the cross encoder model (a call per document?) I am only concerned about the performance here.

One last thing, current I use a truncate processor in the search pipeline to prevent users from abusing ml nodes with reranking over large number of hits at a time. This works well for normal fields. But how would work for nested fields? And is there a way have a similar truncate behavior done on nested fields?

Thanks

@HenryL27
Copy link
Contributor

HenryL27 commented May 7, 2024

I also recommend keeping it simple with one nested field and one nesting level.

Music to my ears!

We'll try to batch the inferences, although your main source of latency is gonna be based on the number of things you're reranking, not the number of inference invocations/network calls. If we want to re-score all the nests then we need to re-score all the nests.

I don't think the truncation processor works on nested fields, but I could be wrong. And as far as I'm aware there isn't one that works specifically on nests... so this will potentially make your system a little abuseable, although I'm not sure of the utility of this kind of abuse apart from a malicious attack. But maybe a nested truncation feature request is in order?

And yes, inner sorting should be easy.

@vamshin vamshin assigned navneet1v and unassigned navneet1v Jul 2, 2024
@vamshin
Copy link
Member

vamshin commented Jul 2, 2024

@HenryL27 are there any plans to pick this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog (Hot)
Development

No branches or pull requests

5 participants