-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Implement pagination for Hybrid Search #280
Comments
When can we expect this in OpenSearch? |
@martin-gaievski @ankitas3 Just double checking, but |
@jackh-ncl it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way. |
Aha thank you for confirming, yeah I have now noticed that I seem to consistently end up with a total hits value of 5 * the size parameter with the query I'm running. |
any chance we can get this in 2.14 or the version after that? |
Is anyone from opensearch able to provide any broad details around when pagination will be made available? |
+1 for this feature request |
@martin-gaievski Can you please clarify why it doesn't work optimally?
|
@martin-gaievski Any idea on when this feature will be available? Thanks |
Hi @vamshin and @martin-gaievski, Just to briefly restate the case here, pagination is a critical feature for many users adopting hybrid search. There has been significant community interest in this thread (at least 20 distinct users) that also point to strong user demand. I'm sure there are several competing dev priorities here, but this seems like a core item that shouldn't fall to the wayside. At the very least, can we get a very rough approximate timeline here? I suspect there are many people on this thread whose downstream roadmaps are reliant on what happens here. Thanks in advance |
@brandon-carag Thank you for your interest in pagination functionality. We understand the importance of this feature, and it's on our roadmap. However, at the moment, we're focusing our efforts on enhancing the sorting and explain (raw scores for debugging) capabilities. While we don't have a definite timeline for implementing pagination, we estimate it could be available towards the end of the year. Please note that this is a rough estimate, and the actual timeline may vary depending on our priorities and resource availability. We're always open to collaboration and would be delighted if someone from the community is willing to contribute to this feature. If you or anyone else is interested in contributing, please feel free to reach out to us. We can provide guidance and support to ensure a smooth integration of the pagination functionality. As a work around, is it possible to use a "size" parameter with a large value? I think we can get upto 10K results |
@vamshin Thanks for the prompt response--that's helpful info, the rough timeline you mentioned is useful to know. I believe martin mentioned above "it has not been implemented yet. With existing code it may work, but not for every scenario and when it works it does it not in an optimal way." I'm not exactly clear on the boundaries where the existing pagination logic breaks down in the existing implementation. For even a relatively small corpus, it seems like specifying a large |
has anyone tried scroll api as an alternative? it works? |
My understanding is that the scroll API won't solve this issue. Documentation states, "Because search contexts consume a lot of memory, we suggest you don’t use the scroll operation for frequent user queries. Instead, use the sort parameter with the search_after parameter to scroll responses for user queries." https://opensearch.org/docs/latest/api-reference/scroll/ |
While we wait for pagination support in Hybrid queries, has anyone found some other way (even if it takes 2+ separate queries) to normalize results of neural+lexical query combo? Really struggling to figure out the best way to combine semantic with lexical search, as the score of neural queries are [0..1] and don't influence the results much if at all. If someone could point in the right direction I would really appreciate. |
Is there any update on this feature? It’s currently listed in the 'Upcoming release (TBD)' section, but I’d like to know which version it will be released under. |
Could you share some info on number of documents per page, maximum number of pages, sample query, and acceptable latency? When you say, Can you do like 1. retrieve only document ids and cache it in client side, 2. get documents for each page using the doc ids? |
Heemin, we appreciate you reaching out for our feedback. We use OpenSearch to power both our web UI and API responses. We limit the max results to 10k documents with a max per page size of 2k. However, our default response size is 20 results with the ability to paginate up to 50 pages (page size is not a requirement). This allows us to keep the typical query responsive while supporting use cases that need larger data sets. We'd like to move from our current use of the standard lexical BM25 search to hybrid search in order to improve the quality of our responses. Without pagination, this transition is more difficult as it would require our API users to make changes to their requests in order to maintain the same behavior. While we could implement a cache and retrieval mechanism outside of OpenSearch as you mentioned, it would add a lot of infrastructure overhead as we serve a very diverse set of queries with varying response sizes and a long tail. Our document corpus is roughly 1 million documents ranging from 1 to ~400 pages. Large response size/pagination latency is less of a concern as these are generally automated API requests in our system. Additionally new documents are only indexed a few times a day so we are also less affected by inconsistent ordering due to index changes (we are aware the OpenSearch docs warn about results not being frozen in time). A typical query might look like: We're happy to answer any additional questions as needed. Thanks! |
@brandon-carag Thanks for the reply. Just to confirm with |
@heemin32 Sure--could you give us a very rough sense of what the latency difference might be for BM25-based pagination vs. hybrid-pagination? Additionally, if you could clarify a bit whether that would mean we'd see a substantial increase in latency for the first page of results for queries returning larger sets of results (eg > 10k) that would be helpful. If the overall latency increase wasn't substantial that would probably be fine for our use case. Thanks! |
@brandon-carag Regardless which page it is, the latency could be same as loading the last page. For example, with your BM25-based pagination, the latency for the first page could be same as the latency for the last page. |
Please checkout RFC on pagination 933 |
Is your feature request related to a problem?
Current implementation of Hybrid search doesn't have support for pagination, meaning all results are returned "at once". That is standard for many queries in OpenSearch and it's expected that Hybrid Search supports it, https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/.
What solution would you like?
It should be possible to define "from" and positions in as part of the Hybrid query, and results should have (from + size) number of records, starting from "from" position. Standard syntax should be fine in this case:
What alternatives have you considered?
It's possible to set higher size and then throw first X elements, but that is required extra processing logic on a client size and is not very optimal.
The text was updated successfully, but these errors were encountered: