seek_exact + cost based intersection #2538

PSeitz · 2024-11-07T09:09:03Z

Add seek_exact and cost to DocSet

Adds seek_exact and cost to DocSet for a more efficient intersection.

Unlike seek, seek_exact does not require the DocSet to advance to the next hit, if the target does not exist.

cost allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improved size_hint heuristic

Improves size_hint estimation for intersection and union, by having a heuristic based on random distribution with a co-location factor.

Other

Refactor range query benchmark.

Closes #2531

Future Work

Implement seek_exact for BufferedUnionScorer and RangeDocSet (fast field range queries)

Evaluate replacing seek with seek_exact to reduce code complexity

Performance

Performance looks very good:

Especially the two phrase critic query is now properly executed:

Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 *Future Work* Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity

src/docset.rs

fulmicoton · 2024-11-12T06:28:26Z

src/docset.rs

+    ///
+    /// ## API Behaviour
+    /// If `seek_exact` is returning true, a call to `doc()` has to return target.
+    /// If `seek_exact` is returning false, a call to `doc()` may return the previous doc,


can you clarify that the result should still be a document in the posting list or TERMINATED, and that it could be something before or after the target.

We also need to clarify what happens if the target is lower than the current doc.

actually it seem like your current implementation might return self.doc() that is not in the docset at all.

This is tricky, because right now we are saying that self.doc could return "anything".

This is not exactly true though.
We need to have an accurate definition of what can we done with docset after having called seek_exact once.

I added some comments to clarify the contract

src/docset.rs

fulmicoton · 2024-11-12T06:36:26Z

src/query/all_query.rs

+        self.doc = target;
+        if self.doc >= self.max_doc {
+            self.doc = TERMINATED;
+        }
+        self.doc


Suggested change

self.doc = target;

if self.doc >= self.max_doc {

self.doc = TERMINATED;

}

self.doc

self.doc = if target_doc >= self.max_doc {

TERMINATED

} else {

target

};

self.doc

I think the slightly less nested variant is a little bit more readable

fulmicoton · 2024-11-12T06:41:41Z

src/query/intersection.rs

                    break;
                }
+                // `left.advance().max(right.doc())` yielded a regression in the search game
+                // benchmark It may make sense in certain scenarios though.
+                candidate = left.advance();


if the regression is small I would sleep better with the max(right.doc()).
(There are a bunch of adversarial cases.)

The regression was not small, so it might be better to identify the cases in which max(right.doc()) is useful and have a separate loop for them

i can give you one adversarial case.

intersection between two terms.

The first term matches 0..1_000_000.
The second term is matching docs 1_000_000..10_000_000.

a more like scenario could be
first term matches 0..1_000_000
second term is random.

This kind of problem actually happen in reality when document are ordered in a certain way.
For instance, eBay and Indeed are sorting by location.
We often sort by time, which is correlated in strange ways (for instance you can think of status:ERROR).

Generally I think it would make sense for all queries that don't implement a specialized seek_exact to use the max(right.doc()) loop.
I don't think there's a way in rust to detect this?

For now we could just detect if the right docset is a TermScorer and then use the max(right.doc()) variant.

I created a issue here to add some missing benchmarks:
#2544

How large was the regression here? Do you have benchmarks?

Actually I think the benchmark was bogus

We are not allowed to call max(right.doc) in the general case.

If the candidate is coming from right, we would need to seek_exact on left.

It wasn't covered by a test then but it is now via proptest.

fulmicoton · 2024-11-12T06:45:33Z

src/query/intersection.rs

+    /// Some implementations may choose to advance past the target if beneficial for performance.
+    /// The return value is `true` if the target is in the docset, and `false` otherwise.
+    fn seek_exact(&mut self, target: DocId) -> bool {
+        self.left.seek_exact(target)


so with after calling seek_exact, we do not necessarily end up with self.doc() returning a valid doc. Is this already taking in account today?

Yes, the intersection code does not use self.doc if seek_exact returns true

src/query/intersection.rs

fulmicoton · 2024-11-12T06:54:00Z

src/query/range_query/fast_field_range_doc_set.rs

@@ -136,9 +139,6 @@ impl<T: Send + Sync + PartialOrd + Copy + Debug + 'static> DocSet for RangeDocSe
        if let Some(docid) = self.loaded_docs.next() {
            return docid;
        }
-        if self.next_fetch_start >= self.column.num_docs() {
-            return TERMINATED;


is this not required? can you explain when & why?

I moved it to fetch_block, so it is easier to use

fulmicoton · 2024-11-12T06:55:53Z

src/query/range_query/fast_field_range_doc_set.rs

+        // index currently (will change with an kd-tree)
+        // Since we use SIMD to scan the fast field range query we lower the cost a little bit.
+        // Ideally this would take the fast field codec into account
+        (self.column.num_docs() as f64 * 0.8) as u64


I would have expected something higher than self.column.num_docs()

It depends on the number of actual hits, if the column has many hits, the cost would be higher, otherwise it would be lower.

If it is higher than num_docs, we would never choose ff range query as the driver in a DocSet, when intersecting a term query with a fast field. But it's the faster choice when the term query has a lot of docids and the range query has not.

src/query/size_hint.rs

fulmicoton · 2024-11-12T06:59:32Z

src/query/size_hint.rs

+    // This factor is used to adjust the estimate.
+    let mut co_loc_factor: f64 = 1.3;
+
+    let mut intersection_estimate = match docset_sizes.next() {


nitpick
let Some else is ok here.

With the conversion to f64 I think match is a little more readable here

src/query/size_hint.rs

fulmicoton · 2024-11-12T07:15:53Z

src/docset.rs

@@ -87,6 +106,23 @@ pub trait DocSet: Send {
    /// length of the docset.
    fn size_hint(&self) -> u32;

+    /// Returns a best-effort hint of the


This is not accurate enough to be usable, and this is actually important here.

Is it the cost of calling advance? Is it the cost of consuming the entire docset? Or is it something else?

I added come comments to clarify

Co-authored-by: Paul Masurel <[email protected]>

fulmicoton · 2024-11-14T02:07:54Z

src/docset.rs

+    /// If `seek_exact` is returning true, a call to `doc()` has to return target.
+    /// If `seek_exact` is returning false, a call to `doc()` may return the previous doc,
+    /// which may be lower than target.
+    fn seek_exact(&mut self, target: DocId) -> bool {


let's find a better name

fulmicoton · 2024-12-02T10:54:45Z

src/query/union/buffered_union.rs

+        let is_hit = self
+            .docsets
+            .iter_mut()
+            .all(|docset| docset.seek_into_the_danger_zone(target));


why all? isn't that a union?

also are we ok with the lazy evaluation here?

PSeitz force-pushed the fix_size_hint branch from 8d7bc08 to 7235d4d Compare November 7, 2024 10:55

PSeitz requested a review from fulmicoton November 7, 2024 11:48