Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider range collection mode for aggs #1905

Closed
PSeitz opened this issue Feb 24, 2023 · 3 comments
Closed

Consider range collection mode for aggs #1905

PSeitz opened this issue Feb 24, 2023 · 3 comments

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Feb 24, 2023

Problem Outline

Pseudo Code current state, this is cause considerable overhead

for docid in 0..5_000_000{
    collector.collect(docid);
}

Range Collection

We could optimize aggregations via collecting range of documents for use cases where aggregation is done over

  1. All documents
  2. A block of consecutive documents (e.g. time range + time sorted index)
pub trait SegmentCollector: 'static {
    ...
    fn collect_range(&mut self, doc: RangeInclusive<DocId>);
}

Enabled Optimizations

  1. If we know we aggregate over all values, we can preallocate or reserve the correct capacity on top level bucket aggs
  2. We can bypass the multi-value/optional index and scan the fast field values directly

Downside

SegmentCollector usually includes a score. This a rather special use case. collect_range could be completely optional like this

pub trait SegmentCollector: 'static {
    ...
    /// Only allowed to call collect_range, if `can_collect_range` returns true
    fn collect_range(&mut self, doc: RangeInclusive<DocId>){}
    fn can_collect_range(&self){
        false
    }
}

Alternative

The aggregations caching layer (caches blocks of docids) could recognize consecutive docids and pass that as metadata, maybe increase caching to bigger blocks. In that case we can't preallocate efficiently. That could be done via hints maybe.

@adamreichold
Copy link
Collaborator

adamreichold commented Feb 24, 2023

collect_range could be completely optional like this

Would it be feasible to provide a fallback implementation that implements the inefficient one-document-at-a-time approach instead doing of nothing?

@PSeitz
Copy link
Contributor Author

PSeitz commented Feb 24, 2023

The score is missing in that case, so either a panic or a fallback implementation with a default score

@PSeitz
Copy link
Contributor Author

PSeitz commented Mar 21, 2023

This is covered by #1937 by adding collect_block, which is more flexible than RangeInclusive<DocId>

@PSeitz PSeitz closed this as completed Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants