Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seek_exact + cost based intersection #2538

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

seek_exact + cost based intersection #2538

wants to merge 5 commits into from

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Nov 7, 2024

Add seek_exact and cost to DocSet

Adds seek_exact and cost to DocSet for a more efficient intersection.

Unlike seek, seek_exact does not require the DocSet to advance to the next hit, if the target does not exist.

cost allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improved size_hint heuristic

Improves size_hint estimation for intersection and union, by having a heuristic based on random distribution with a co-location factor.

Other

Refactor range query benchmark.

Closes #2531

Future Work

Implement seek_exact for BufferedUnionScorer and RangeDocSet (fast field range queries)

Evaluate replacing seek with seek_exact to reduce code complexity

Performance

Performance looks very good:

seek_exact perf

Especially the two phrase critic query is now properly executed:

seek_exact perf2

Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.

`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.

Refactor range query benchmark.

Closes #2531

*Future Work*

Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity
src/docset.rs Outdated Show resolved Hide resolved
src/docset.rs Outdated
///
/// ## API Behaviour
/// If `seek_exact` is returning true, a call to `doc()` has to return target.
/// If `seek_exact` is returning false, a call to `doc()` may return the previous doc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify that the result should still be a document in the posting list or TERMINATED, and that it could be something before or after the target.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to clarify what happens if the target is lower than the current doc.

Copy link
Collaborator

@fulmicoton fulmicoton Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it seem like your current implementation might return self.doc() that is not in the docset at all.

This is tricky, because right now we are saying that self.doc could return "anything".

This is not exactly true though.
We need to have an accurate definition of what can we done with docset after having called seek_exact once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments to clarify the contract

Comment on lines +63 to +67
self.doc = target;
if self.doc >= self.max_doc {
self.doc = TERMINATED;
}
self.doc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.doc = target;
if self.doc >= self.max_doc {
self.doc = TERMINATED;
}
self.doc
self.doc = if target_doc >= self.max_doc {
TERMINATED
} else {
target
};
self.doc

Copy link
Contributor Author

@PSeitz PSeitz Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the slightly less nested variant is a little bit more readable

break;
}
// `left.advance().max(right.doc())` yielded a regression in the search game
// benchmark It may make sense in certain scenarios though.
candidate = left.advance();
Copy link
Collaborator

@fulmicoton fulmicoton Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the regression is small I would sleep better with the max(right.doc()).
(There are a bunch of adversarial cases.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression was not small, so it might be better to identify the cases in which max(right.doc()) is useful and have a separate loop for them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can give you one adversarial case.

intersection between two terms.

The first term matches 0..1_000_000.
The second term is matching docs 1_000_000..10_000_000.

a more like scenario could be
first term matches 0..1_000_000
second term is random.

This kind of problem actually happen in reality when document are ordered in a certain way.
For instance, eBay and Indeed are sorting by location.
We often sort by time, which is correlated in strange ways (for instance you can think of status:ERROR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think it would make sense for all queries that don't implement a specialized seek_exact to use the max(right.doc()) loop.
I don't think there's a way in rust to detect this?

For now we could just detect if the right docset is a TermScorer and then use the max(right.doc()) variant.

I created a issue here to add some missing benchmarks:
#2544

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large was the regression here? Do you have benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the benchmark was bogus

  1. We are not allowed to call max(right.doc) in the general case.
  2. If the candidate is coming from right, we would need to seek_exact on left.

It wasn't covered by a test then but it is now via proptest.

/// Some implementations may choose to advance past the target if beneficial for performance.
/// The return value is `true` if the target is in the docset, and `false` otherwise.
fn seek_exact(&mut self, target: DocId) -> bool {
self.left.seek_exact(target)
Copy link
Collaborator

@fulmicoton fulmicoton Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so with after calling seek_exact, we do not necessarily end up with self.doc() returning a valid doc. Is this already taking in account today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intersection code does not use self.doc if seek_exact returns true

@@ -136,9 +139,6 @@ impl<T: Send + Sync + PartialOrd + Copy + Debug + 'static> DocSet for RangeDocSe
if let Some(docid) = self.loaded_docs.next() {
return docid;
}
if self.next_fetch_start >= self.column.num_docs() {
return TERMINATED;
Copy link
Collaborator

@fulmicoton fulmicoton Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not required? can you explain when & why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to fetch_block, so it is easier to use

// index currently (will change with an kd-tree)
// Since we use SIMD to scan the fast field range query we lower the cost a little bit.
// Ideally this would take the fast field codec into account
(self.column.num_docs() as f64 * 0.8) as u64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected something higher than self.column.num_docs()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the number of actual hits, if the column has many hits, the cost would be higher, otherwise it would be lower.

If it is higher than num_docs, we would never choose ff range query as the driver in a DocSet, when intersecting a term query with a fast field. But it's the faster choice when the term query has a lot of docids and the range query has not.

// This factor is used to adjust the estimate.
let mut co_loc_factor: f64 = 1.3;

let mut intersection_estimate = match docset_sizes.next() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick
let Some else is ok here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the conversion to f64 I think match is a little more readable here

src/query/size_hint.rs Outdated Show resolved Hide resolved
src/docset.rs Outdated
@@ -87,6 +106,23 @@ pub trait DocSet: Send {
/// length of the docset.
fn size_hint(&self) -> u32;

/// Returns a best-effort hint of the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate enough to be usable, and this is actually important here.

Is it the cost of calling advance? Is it the cost of consuming the entire docset? Or is it something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added come comments to clarify

@PSeitz PSeitz force-pushed the fix_size_hint branch 2 times, most recently from 4f26713 to eff7353 Compare November 13, 2024 04:25
src/docset.rs Outdated
/// If `seek_exact` is returning true, a call to `doc()` has to return target.
/// If `seek_exact` is returning false, a call to `doc()` may return the previous doc,
/// which may be lower than target.
fn seek_exact(&mut self, target: DocId) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's find a better name

let is_hit = self
.docsets
.iter_mut()
.all(|docset| docset.seek_into_the_danger_zone(target));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why all? isn't that a union?

also are we ok with the lazy evaluation here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intersection Performance
2 participants