Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Implement pagination for Hybrid Search #280

Open
martin-gaievski opened this issue Sep 4, 2023 · 22 comments
Open

[FEATURE] Implement pagination for Hybrid Search #280

martin-gaievski opened this issue Sep 4, 2023 · 22 comments
Assignees
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications Roadmap:Vector Database/GenAI Project-wide roadmap label v2.19.0

Comments

@martin-gaievski
Copy link
Member

Is your feature request related to a problem?

Current implementation of Hybrid search doesn't have support for pagination, meaning all results are returned "at once". That is standard for many queries in OpenSearch and it's expected that Hybrid Search supports it, https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/.

What solution would you like?

It should be possible to define "from" and positions in as part of the Hybrid query, and results should have (from + size) number of records, starting from "from" position. Standard syntax should be fine in this case:

{
   "from": 20,
   "size": 10,
   "query": {
       "hybrid": [
           {},// First Query
           {} // Second Query
           ..... // Other Queries
       ] 
   }
}

What alternatives have you considered?

It's possible to set higher size and then throw first X elements, but that is required extra processing logic on a client size and is not very optimal.

@martin-gaievski martin-gaievski added Enhancements Increases software capabilities beyond original client specifications untriaged labels Sep 4, 2023
@navneet1v navneet1v added backlog All the backlog features should be marked with this label and removed untriaged labels Sep 6, 2023
@ankitas3
Copy link

ankitas3 commented Apr 1, 2024

When can we expect this in OpenSearch?

@jackh-ncl
Copy link

@martin-gaievski @ankitas3 Just double checking, but from and size already seem to be working for me with Hybrid search while using Opensearch 2.11. Has it been implemented since this issue was raised?

@martin-gaievski
Copy link
Member Author

@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way.

@jackh-ncl
Copy link

Aha thank you for confirming, yeah I have now noticed that I seem to consistently end up with a total hits value of 5 * the size parameter with the query I'm running.

@benmcginnis
Copy link

benmcginnis commented Apr 26, 2024

any chance we can get this in 2.14 or the version after that?

@brandon-carag
Copy link

Is anyone from opensearch able to provide any broad details around when pagination will be made available?

@qmauret
Copy link

qmauret commented May 20, 2024

+1 for this feature request

@mkerimyilmaz
Copy link

@martin-gaievski Can you please clarify why it doesn't work optimally?

@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way.

@JPSoteloSilva
Copy link

@martin-gaievski Any idea on when this feature will be available? Thanks

@brandon-carag
Copy link

Hi @vamshin and @martin-gaievski,

Just to briefly restate the case here, pagination is a critical feature for many users adopting hybrid search. There has been significant community interest in this thread (at least 20 distinct users) that also point to strong user demand.

I'm sure there are several competing dev priorities here, but this seems like a core item that shouldn't fall to the wayside. At the very least, can we get a very rough approximate timeline here? I suspect there are many people on this thread whose downstream roadmaps are reliant on what happens here.

Thanks in advance

@vamshin
Copy link
Member

vamshin commented Jul 16, 2024

@brandon-carag Thank you for your interest in pagination functionality. We understand the importance of this feature, and it's on our roadmap. However, at the moment, we're focusing our efforts on enhancing the sorting and explain (raw scores for debugging) capabilities.

While we don't have a definite timeline for implementing pagination, we estimate it could be available towards the end of the year. Please note that this is a rough estimate, and the actual timeline may vary depending on our priorities and resource availability.

We're always open to collaboration and would be delighted if someone from the community is willing to contribute to this feature. If you or anyone else is interested in contributing, please feel free to reach out to us. We can provide guidance and support to ensure a smooth integration of the pagination functionality.

As a work around, is it possible to use a "size" parameter with a large value? I think we can get upto 10K results

@brandon-carag
Copy link

brandon-carag commented Jul 16, 2024

@vamshin Thanks for the prompt response--that's helpful info, the rough timeline you mentioned is useful to know. I believe martin mentioned above "it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way." I'm not exactly clear on the boundaries where the existing pagination logic breaks down in the existing implementation.

For even a relatively small corpus, it seems like specifying a large size and processing client-side won't scale for even a relatively small index, particularly if OpenSearch is being used to service API requests and apply pagination. As such, hoping this feature can emerge sooner rather than later.

@sonic182
Copy link

has anyone tried scroll api as an alternative? it works?

@brandon-carag
Copy link

My understanding is that the scroll API won't solve this issue. Documentation states, "Because search contexts consume a lot of memory, we suggest you don’t use the scroll operation for frequent user queries. Instead, use the sort parameter with the search_after parameter to scroll responses for user queries." https://opensearch.org/docs/latest/api-reference/scroll/

@github-project-automation github-project-automation bot moved this to Upcoming release (TBD) in OpenSearch Project Roadmap Aug 30, 2024
@Romasato
Copy link

Romasato commented Sep 6, 2024

While we wait for pagination support in Hybrid queries, has anyone found some other way (even if it takes 2+ separate queries) to normalize results of neural+lexical query combo?

Really struggling to figure out the best way to combine semantic with lexical search, as the score of neural queries are [0..1] and don't influence the results much if at all.

If someone could point in the right direction I would really appreciate.
Thank you

@safakkbilici
Copy link

Is there any update on this feature? It’s currently listed in the 'Upcoming release (TBD)' section, but I’d like to know which version it will be released under.

@vamshin vamshin added v2.19.0 Roadmap:Vector Database/GenAI Project-wide roadmap label and removed hybrid search labels Sep 24, 2024
@vamshin vamshin moved this from Backlog (Hot) to 2.19.0 in Vector Search RoadMap Sep 30, 2024
@heemin32
Copy link
Collaborator

heemin32 commented Oct 9, 2024

@brandon-carag

For even a relatively small corpus, it seems like specifying a large size and processing client-side won't scale for even a relatively small index, particularly if OpenSearch is being used to service API requests and apply pagination. As such, hoping this feature can emerge sooner rather than later.

Could you share some info on number of documents per page, maximum number of pages, sample query, and acceptable latency?

When you say, specifying a large size and processing client-side won't scale, Is the issue in opensearch side or client side?

Can you do like 1. retrieve only document ids and cache it in client side, 2. get documents for each page using the doc ids?

@brandon-carag
Copy link

Heemin, we appreciate you reaching out for our feedback. We use OpenSearch to power both our web UI and API responses. We limit the max results to 10k documents with a max per page size of 2k. However, our default response size is 20 results with the ability to paginate up to 50 pages (page size is not a requirement). This allows us to keep the typical query responsive while supporting use cases that need larger data sets.

We'd like to move from our current use of the standard lexical BM25 search to hybrid search in order to improve the quality of our responses. Without pagination, this transition is more difficult as it would require our API users to make changes to their requests in order to maintain the same behavior.

While we could implement a cache and retrieval mechanism outside of OpenSearch as you mentioned, it would add a lot of infrastructure overhead as we serve a very diverse set of queries with varying response sizes and a long tail.

Our document corpus is roughly 1 million documents ranging from 1 to ~400 pages. Large response size/pagination latency is less of a concern as these are generally automated API requests in our system. Additionally new documents are only indexed a few times a day so we are also less affected by inconsistent ordering due to index changes (we are aware the OpenSearch docs warn about results not being frozen in time).

A typical query might look like:
UI generated: https://www.federalregister.gov/api/v1/documents.json?per_page=20&conditions%5Bterm%5D=hurricane
API: https://www.federalregister.gov/api/v1/documents.json?per_page=1000&conditions%5Bterm%5D=hurricane

We're happy to answer any additional questions as needed. Thanks!

@heemin32
Copy link
Collaborator

@brandon-carag Thanks for the reply. Just to confirm with latency is less of concern, do you mean even if the latency of the first page is same as the latency of the last page, it is okay for your use case?

@brandon-carag
Copy link

@heemin32 Sure--could you give us a very rough sense of what the latency difference might be for BM25-based pagination vs. hybrid-pagination? Additionally, if you could clarify a bit whether that would mean we'd see a substantial increase in latency for the first page of results for queries returning larger sets of results (eg > 10k) that would be helpful. If the overall latency increase wasn't substantial that would probably be fine for our use case. Thanks!

@heemin32
Copy link
Collaborator

heemin32 commented Oct 10, 2024

@brandon-carag Regardless which page it is, the latency could be same as loading the last page. For example, with your BM25-based pagination, the latency for the first page could be same as the latency for the last page.

@vibrantvarun
Copy link
Member

Please checkout RFC on pagination 933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications Roadmap:Vector Database/GenAI Project-wide roadmap label v2.19.0
Projects
Status: Upcoming release (TBD)
Status: New
Status: 2.19.0
Development

No branches or pull requests