-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimising sorted scroll requests #23022
Comments
I think one way that we could make it work in the general case would be to start collecting with an open-ended range filter, and every X collected docs (X being a multiple of
In the worst case that the index is sorted in the reverse order, this would make things slower than they are today, but in the average case (or in the best case that the index is sorted in the same order) I think this could make things much faster? |
@elastic/es-search-aggs |
Hi. Is this issue still open? |
Yes. We have made progress for queries sorted by score when we introduced block-max WAND and we are about to make progress for queries sorted by field as well via #39770. |
This issue is effectively implemented when sorting on |
The performance of sorted scroll requests can be dominated by the time it takes to sort all documents on each tranche of hits. This can partially be amortised by increasing the
size
of the scroll request, but that strategy soon starts to fail for other reasons. Ultimately the more documents you have, the longer it takes to sort them.When sorting by e.g. date, it can be much more efficient to break a single scroll request up into chunks, so that each scroll request deals with a subset of docs within a certain date range. Anecdotal evidence on an index of 50M docs reports an improvement from 7h to 10 mins!
It would be nice to be able to automate this internally within a single scroll request. The trickiest part is to figure out how big a chunk should be, given that data can be non-uniform. Simply asking the user wouldn't be sufficient as they may set a chunk of 1 hour, but an hour of missing data would simply return no results, indicating the end of the scroll request.
Here are a few possibilities:
gt
but notlt
- the deeper you get the fewer documents you would matchThe text was updated successfully, but these errors were encountered: