Consider range collection mode for aggs #1905

PSeitz · 2023-02-24T07:41:29Z

Problem Outline

Pseudo Code current state, this is cause considerable overhead

for docid in 0..5_000_000{
    collector.collect(docid);
}

Range Collection

We could optimize aggregations via collecting range of documents for use cases where aggregation is done over

All documents
A block of consecutive documents (e.g. time range + time sorted index)

pub trait SegmentCollector: 'static {
    ...
    fn collect_range(&mut self, doc: RangeInclusive<DocId>);
}

Enabled Optimizations

If we know we aggregate over all values, we can preallocate or reserve the correct capacity on top level bucket aggs
We can bypass the multi-value/optional index and scan the fast field values directly

Downside

SegmentCollector usually includes a score. This a rather special use case. collect_range could be completely optional like this

pub trait SegmentCollector: 'static {
    ...
    /// Only allowed to call collect_range, if `can_collect_range` returns true
    fn collect_range(&mut self, doc: RangeInclusive<DocId>){}
    fn can_collect_range(&self){
        false
    }
}

Alternative

The aggregations caching layer (caches blocks of docids) could recognize consecutive docids and pass that as metadata, maybe increase caching to bigger blocks. In that case we can't preallocate efficiently. That could be done via hints maybe.

The text was updated successfully, but these errors were encountered:

adamreichold · 2023-02-24T13:57:35Z

collect_range could be completely optional like this

Would it be feasible to provide a fallback implementation that implements the inefficient one-document-at-a-time approach instead doing of nothing?

PSeitz · 2023-02-24T14:58:03Z

The score is missing in that case, so either a panic or a fallback implementation with a default score

PSeitz · 2023-03-21T14:12:19Z

This is covered by #1937 by adding collect_block, which is more flexible than RangeInclusive<DocId>

PSeitz closed this as completed Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider range collection mode for aggs #1905

Consider range collection mode for aggs #1905

PSeitz commented Feb 24, 2023

adamreichold commented Feb 24, 2023 •

edited

Loading

PSeitz commented Feb 24, 2023

PSeitz commented Mar 21, 2023

Consider range collection mode for aggs #1905

Consider range collection mode for aggs #1905

Comments

PSeitz commented Feb 24, 2023

Problem Outline

Range Collection

Enabled Optimizations

Downside

Alternative

adamreichold commented Feb 24, 2023 • edited Loading

PSeitz commented Feb 24, 2023

PSeitz commented Mar 21, 2023

adamreichold commented Feb 24, 2023 •

edited

Loading