-
-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seek_exact + cost based intersection #2538
base: main
Are you sure you want to change the base?
Conversation
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 *Future Work* Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity
src/docset.rs
Outdated
/// | ||
/// ## API Behaviour | ||
/// If `seek_exact` is returning true, a call to `doc()` has to return target. | ||
/// If `seek_exact` is returning false, a call to `doc()` may return the previous doc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify that the result should still be a document in the posting list or TERMINATED, and that it could be something before or after the target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to clarify what happens if the target is lower than the current doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it seem like your current implementation might return self.doc()
that is not in the docset at all.
This is tricky, because right now we are saying that self.doc could return "anything".
This is not exactly true though.
We need to have an accurate definition of what can we done with docset after having called seek_exact once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments to clarify the contract
self.doc = target; | ||
if self.doc >= self.max_doc { | ||
self.doc = TERMINATED; | ||
} | ||
self.doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.doc = target; | |
if self.doc >= self.max_doc { | |
self.doc = TERMINATED; | |
} | |
self.doc | |
self.doc = if target_doc >= self.max_doc { | |
TERMINATED | |
} else { | |
target | |
}; | |
self.doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the slightly less nested variant is a little bit more readable
src/query/intersection.rs
Outdated
break; | ||
} | ||
// `left.advance().max(right.doc())` yielded a regression in the search game | ||
// benchmark It may make sense in certain scenarios though. | ||
candidate = left.advance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the regression is small I would sleep better with the max(right.doc())
.
(There are a bunch of adversarial cases.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regression was not small, so it might be better to identify the cases in which max(right.doc())
is useful and have a separate loop for them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can give you one adversarial case.
intersection between two terms.
The first term matches 0..1_000_000.
The second term is matching docs 1_000_000..10_000_000.
a more like scenario could be
first term matches 0..1_000_000
second term is random.
This kind of problem actually happen in reality when document are ordered in a certain way.
For instance, eBay and Indeed are sorting by location.
We often sort by time, which is correlated in strange ways (for instance you can think of status:ERROR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think it would make sense for all queries that don't implement a specialized seek_exact
to use the max(right.doc())
loop.
I don't think there's a way in rust to detect this?
For now we could just detect if the right docset is a TermScorer
and then use the max(right.doc())
variant.
I created a issue here to add some missing benchmarks:
#2544
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large was the regression here? Do you have benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the benchmark was bogus
- We are not allowed to call
max(right.doc)
in the general case. - If the candidate is coming from right, we would need to seek_exact on left.
It wasn't covered by a test then but it is now via proptest.
src/query/intersection.rs
Outdated
/// Some implementations may choose to advance past the target if beneficial for performance. | ||
/// The return value is `true` if the target is in the docset, and `false` otherwise. | ||
fn seek_exact(&mut self, target: DocId) -> bool { | ||
self.left.seek_exact(target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so with after calling seek_exact, we do not necessarily end up with self.doc()
returning a valid doc. Is this already taking in account today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intersection code does not use self.doc if seek_exact returns true
@@ -136,9 +139,6 @@ impl<T: Send + Sync + PartialOrd + Copy + Debug + 'static> DocSet for RangeDocSe | |||
if let Some(docid) = self.loaded_docs.next() { | |||
return docid; | |||
} | |||
if self.next_fetch_start >= self.column.num_docs() { | |||
return TERMINATED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not required? can you explain when & why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to fetch_block
, so it is easier to use
// index currently (will change with an kd-tree) | ||
// Since we use SIMD to scan the fast field range query we lower the cost a little bit. | ||
// Ideally this would take the fast field codec into account | ||
(self.column.num_docs() as f64 * 0.8) as u64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected something higher than self.column.num_docs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the number of actual hits, if the column has many hits, the cost would be higher, otherwise it would be lower.
If it is higher than num_docs, we would never choose ff range query as the driver in a DocSet, when intersecting a term query with a fast field. But it's the faster choice when the term query has a lot of docids and the range query has not.
// This factor is used to adjust the estimate. | ||
let mut co_loc_factor: f64 = 1.3; | ||
|
||
let mut intersection_estimate = match docset_sizes.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
let Some else
is ok here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the conversion to f64 I think match
is a little more readable here
src/docset.rs
Outdated
@@ -87,6 +106,23 @@ pub trait DocSet: Send { | |||
/// length of the docset. | |||
fn size_hint(&self) -> u32; | |||
|
|||
/// Returns a best-effort hint of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate enough to be usable, and this is actually important here.
Is it the cost of calling advance? Is it the cost of consuming the entire docset? Or is it something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added come comments to clarify
Co-authored-by: Paul Masurel <[email protected]>
4f26713
to
eff7353
Compare
eff7353
to
ec6f5b3
Compare
src/docset.rs
Outdated
/// If `seek_exact` is returning true, a call to `doc()` has to return target. | ||
/// If `seek_exact` is returning false, a call to `doc()` may return the previous doc, | ||
/// which may be lower than target. | ||
fn seek_exact(&mut self, target: DocId) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's find a better name
let is_hit = self | ||
.docsets | ||
.iter_mut() | ||
.all(|docset| docset.seek_into_the_danger_zone(target)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why all? isn't that a union?
also are we ok with the lazy evaluation here?
Add seek_exact and cost to DocSet
Adds
seek_exact
andcost
toDocSet
for a more efficient intersection.Unlike
seek
,seek_exact
does not require the DocSet to advance to the next hit, if the target does not exist.cost
allows to address the different DocSet types and their costmodel and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.
Improved size_hint heuristic
Improves
size_hint
estimation for intersection and union, by having a heuristic based on random distribution with a co-location factor.Other
Refactor range query benchmark.
Closes #2531
Future Work
Implement
seek_exact
for BufferedUnionScorer and RangeDocSet (fast field range queries)Evaluate replacing
seek
withseek_exact
to reduce code complexityPerformance
Performance looks very good:
Especially the two phrase critic query is now properly executed: