[FEATURE] Implement pagination for Hybrid Search #280

martin-gaievski · 2023-09-04T22:54:39Z

Is your feature request related to a problem?

Current implementation of Hybrid search doesn't have support for pagination, meaning all results are returned "at once". That is standard for many queries in OpenSearch and it's expected that Hybrid Search supports it, https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/.

What solution would you like?

It should be possible to define "from" and positions in as part of the Hybrid query, and results should have (from + size) number of records, starting from "from" position. Standard syntax should be fine in this case:

{
   "from": 20,
   "size": 10,
   "query": {
       "hybrid": [
           {},// First Query
           {} // Second Query
           ..... // Other Queries
       ] 
   }
}

What alternatives have you considered?

It's possible to set higher size and then throw first X elements, but that is required extra processing logic on a client size and is not very optimal.

The text was updated successfully, but these errors were encountered:

ankitas3 · 2024-04-01T08:26:11Z

When can we expect this in OpenSearch?

jackh-ncl · 2024-04-18T10:31:47Z

@martin-gaievski @ankitas3 Just double checking, but from and size already seem to be working for me with Hybrid search while using Opensearch 2.11. Has it been implemented since this issue was raised?

martin-gaievski · 2024-04-18T16:07:33Z

@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way.

jackh-ncl · 2024-04-18T18:03:49Z

Aha thank you for confirming, yeah I have now noticed that I seem to consistently end up with a total hits value of 5 * the size parameter with the query I'm running.

benmcginnis · 2024-04-26T18:08:15Z

any chance we can get this in 2.14 or the version after that?

brandon-carag · 2024-05-07T17:17:27Z

Is anyone from opensearch able to provide any broad details around when pagination will be made available?

qmauret · 2024-05-20T12:33:42Z

+1 for this feature request

mkerimyilmaz · 2024-05-30T06:50:56Z

@martin-gaievski Can you please clarify why it doesn't work optimally?

@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way.

JPSoteloSilva · 2024-06-04T11:08:50Z

@martin-gaievski Any idea on when this feature will be available? Thanks

brandon-carag · 2024-07-16T17:26:37Z

Hi @vamshin and @martin-gaievski,

Just to briefly restate the case here, pagination is a critical feature for many users adopting hybrid search. There has been significant community interest in this thread (at least 20 distinct users) that also point to strong user demand.

I'm sure there are several competing dev priorities here, but this seems like a core item that shouldn't fall to the wayside. At the very least, can we get a very rough approximate timeline here? I suspect there are many people on this thread whose downstream roadmaps are reliant on what happens here.

Thanks in advance

vamshin · 2024-07-16T19:04:18Z

@brandon-carag Thank you for your interest in pagination functionality. We understand the importance of this feature, and it's on our roadmap. However, at the moment, we're focusing our efforts on enhancing the sorting and explain (raw scores for debugging) capabilities.

While we don't have a definite timeline for implementing pagination, we estimate it could be available towards the end of the year. Please note that this is a rough estimate, and the actual timeline may vary depending on our priorities and resource availability.

We're always open to collaboration and would be delighted if someone from the community is willing to contribute to this feature. If you or anyone else is interested in contributing, please feel free to reach out to us. We can provide guidance and support to ensure a smooth integration of the pagination functionality.

As a work around, is it possible to use a "size" parameter with a large value? I think we can get upto 10K results

brandon-carag · 2024-07-16T19:43:51Z

@vamshin Thanks for the prompt response--that's helpful info, the rough timeline you mentioned is useful to know. I believe martin mentioned above "it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way." I'm not exactly clear on the boundaries where the existing pagination logic breaks down in the existing implementation.

For even a relatively small corpus, it seems like specifying a large size and processing client-side won't scale for even a relatively small index, particularly if OpenSearch is being used to service API requests and apply pagination. As such, hoping this feature can emerge sooner rather than later.

sonic182 · 2024-08-22T09:33:34Z

has anyone tried scroll api as an alternative? it works?

brandon-carag · 2024-08-22T17:20:24Z

My understanding is that the scroll API won't solve this issue. Documentation states, "Because search contexts consume a lot of memory, we suggest you don’t use the scroll operation for frequent user queries. Instead, use the sort parameter with the search_after parameter to scroll responses for user queries." https://opensearch.org/docs/latest/api-reference/scroll/

Romasato · 2024-09-06T17:56:59Z

While we wait for pagination support in Hybrid queries, has anyone found some other way (even if it takes 2+ separate queries) to normalize results of neural+lexical query combo?

Really struggling to figure out the best way to combine semantic with lexical search, as the score of neural queries are [0..1] and don't influence the results much if at all.

If someone could point in the right direction I would really appreciate.
Thank you

safakkbilici · 2024-09-24T14:24:40Z

Is there any update on this feature? It’s currently listed in the 'Upcoming release (TBD)' section, but I’d like to know which version it will be released under.

heemin32 · 2024-10-09T19:22:45Z

@brandon-carag

For even a relatively small corpus, it seems like specifying a large size and processing client-side won't scale for even a relatively small index, particularly if OpenSearch is being used to service API requests and apply pagination. As such, hoping this feature can emerge sooner rather than later.

Could you share some info on number of documents per page, maximum number of pages, sample query, and acceptable latency?

When you say, specifying a large size and processing client-side won't scale, Is the issue in opensearch side or client side?

Can you do like 1. retrieve only document ids and cache it in client side, 2. get documents for each page using the doc ids?

brandon-carag · 2024-10-09T23:48:44Z

Heemin, we appreciate you reaching out for our feedback. We use OpenSearch to power both our web UI and API responses. We limit the max results to 10k documents with a max per page size of 2k. However, our default response size is 20 results with the ability to paginate up to 50 pages (page size is not a requirement). This allows us to keep the typical query responsive while supporting use cases that need larger data sets.

We'd like to move from our current use of the standard lexical BM25 search to hybrid search in order to improve the quality of our responses. Without pagination, this transition is more difficult as it would require our API users to make changes to their requests in order to maintain the same behavior.

While we could implement a cache and retrieval mechanism outside of OpenSearch as you mentioned, it would add a lot of infrastructure overhead as we serve a very diverse set of queries with varying response sizes and a long tail.

Our document corpus is roughly 1 million documents ranging from 1 to ~400 pages. Large response size/pagination latency is less of a concern as these are generally automated API requests in our system. Additionally new documents are only indexed a few times a day so we are also less affected by inconsistent ordering due to index changes (we are aware the OpenSearch docs warn about results not being frozen in time).

A typical query might look like:
UI generated: https://www.federalregister.gov/api/v1/documents.json?per_page=20&conditions%5Bterm%5D=hurricane
API: https://www.federalregister.gov/api/v1/documents.json?per_page=1000&conditions%5Bterm%5D=hurricane

We're happy to answer any additional questions as needed. Thanks!

heemin32 · 2024-10-10T00:18:23Z

@brandon-carag Thanks for the reply. Just to confirm with latency is less of concern, do you mean even if the latency of the first page is same as the latency of the last page, it is okay for your use case?

brandon-carag · 2024-10-10T00:44:08Z

@heemin32 Sure--could you give us a very rough sense of what the latency difference might be for BM25-based pagination vs. hybrid-pagination? Additionally, if you could clarify a bit whether that would mean we'd see a substantial increase in latency for the first page of results for queries returning larger sets of results (eg > 10k) that would be helpful. If the overall latency increase wasn't substantial that would probably be fine for our use case. Thanks!

heemin32 · 2024-10-10T00:50:45Z

@brandon-carag Regardless which page it is, the latency could be same as loading the last page. For example, with your BM25-based pagination, the latency for the first page could be same as the latency for the last page.

vibrantvarun · 2024-10-15T04:03:39Z

Please checkout RFC on pagination 933

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications untriaged labels Sep 4, 2023

navneet1v added this to Vector Search RoadMap Sep 6, 2023

github-project-automation bot moved this to Backlog in Vector Search RoadMap Sep 6, 2023

navneet1v added backlog All the backlog features should be marked with this label and removed untriaged labels Sep 6, 2023

martin-gaievski mentioned this issue Sep 25, 2023

Add hybrid search blog opensearch-project/project-website#2182

Merged

1 task

vamshin assigned vamshin, martin-gaievski and vibrantvarun Jul 16, 2024

vamshin moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Jul 16, 2024

vamshin added the hybrid search label Jul 16, 2024

sonic182 mentioned this issue Aug 22, 2024

[BUG] Error doing pagination with hybrid searches opensearch-project/OpenSearch#15350

Open

github-project-automation bot added this to OpenSearch Project Roadmap Aug 30, 2024

github-project-automation bot moved this to Upcoming release (TBD) in OpenSearch Project Roadmap Aug 30, 2024

vamshin added v2.19.0 Roadmap:Vector Database/GenAI Project-wide roadmap label and removed hybrid search labels Sep 24, 2024

opensearch-infra bot added this to OpenSearch Roadmap Sep 30, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Sep 30, 2024

vamshin moved this from Backlog (Hot) to 2.19.0 in Vector Search RoadMap Sep 30, 2024

minalsha unassigned vamshin and martin-gaievski Nov 4, 2024

martin-gaievski mentioned this issue Nov 15, 2024

[Feature] Pagination in hybrid query #963

Open

5 tasks

martin-gaievski mentioned this issue Dec 12, 2024

[DOC] Adding pagination in Hybrid query opensearch-project/documentation-website#8952

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Implement pagination for Hybrid Search #280

[FEATURE] Implement pagination for Hybrid Search #280

martin-gaievski commented Sep 4, 2023

ankitas3 commented Apr 1, 2024

jackh-ncl commented Apr 18, 2024

martin-gaievski commented Apr 18, 2024

jackh-ncl commented Apr 18, 2024

benmcginnis commented Apr 26, 2024 •

edited

Loading

brandon-carag commented May 7, 2024

qmauret commented May 20, 2024

mkerimyilmaz commented May 30, 2024

JPSoteloSilva commented Jun 4, 2024

brandon-carag commented Jul 16, 2024

vamshin commented Jul 16, 2024

brandon-carag commented Jul 16, 2024 •

edited

Loading

sonic182 commented Aug 22, 2024

brandon-carag commented Aug 22, 2024

Romasato commented Sep 6, 2024

safakkbilici commented Sep 24, 2024

heemin32 commented Oct 9, 2024 •

edited

Loading

brandon-carag commented Oct 9, 2024

heemin32 commented Oct 10, 2024

brandon-carag commented Oct 10, 2024

heemin32 commented Oct 10, 2024 •

edited

Loading

vibrantvarun commented Oct 15, 2024

[FEATURE] Implement pagination for Hybrid Search #280

[FEATURE] Implement pagination for Hybrid Search #280

Comments

martin-gaievski commented Sep 4, 2023

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

ankitas3 commented Apr 1, 2024

jackh-ncl commented Apr 18, 2024

martin-gaievski commented Apr 18, 2024

jackh-ncl commented Apr 18, 2024

benmcginnis commented Apr 26, 2024 • edited Loading

brandon-carag commented May 7, 2024

qmauret commented May 20, 2024

mkerimyilmaz commented May 30, 2024

JPSoteloSilva commented Jun 4, 2024

brandon-carag commented Jul 16, 2024

vamshin commented Jul 16, 2024

brandon-carag commented Jul 16, 2024 • edited Loading

sonic182 commented Aug 22, 2024

brandon-carag commented Aug 22, 2024

Romasato commented Sep 6, 2024

safakkbilici commented Sep 24, 2024

heemin32 commented Oct 9, 2024 • edited Loading

brandon-carag commented Oct 9, 2024

heemin32 commented Oct 10, 2024

brandon-carag commented Oct 10, 2024

heemin32 commented Oct 10, 2024 • edited Loading

vibrantvarun commented Oct 15, 2024

benmcginnis commented Apr 26, 2024 •

edited

Loading

brandon-carag commented Jul 16, 2024 •

edited

Loading

heemin32 commented Oct 9, 2024 •

edited

Loading

heemin32 commented Oct 10, 2024 •

edited

Loading