Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals #1097

Merged
merged 8 commits into from
Feb 6, 2020

Conversation

romseygeek
Copy link
Contributor

If you have repeating intervals in an ordered or unordered interval source, you currently get somewhat confusing behaviour:

  • ORDERED(a, a, b) will return an extra interval over just a b if it first matches a a b, meaning that you can get incorrect results if used in a CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on the document a x a b y
  • UNORDERED(a, a) will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats within ordered and unordered sources. It also changes the way that gaps are calculated within ordered and unordered sources, by using a new width() method on IntervalIterator. The default implementation just returns end() - start() + 1, but RepeatingIntervalsSource instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches now always returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds gaps() and width() methods so that submatches can be combined in the same way that subiterators are. Extra checks have been added to checkIntervals() to ensure that the same intervals are returned by both iterator and matches, and a fix to DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between the two methods.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the code becomes a bit difficult to follow but it's great that we can handle repeats correctly. This is another advantage over Span queries so +1 to merge.

@romseygeek romseygeek merged commit 7c1ba1a into apache:master Feb 6, 2020
@romseygeek romseygeek deleted the intervals-gaps branch February 6, 2020 14:44
asfgit pushed a commit that referenced this pull request Feb 6, 2020
…als (#1097)

If you have repeating intervals in an ordered or unordered interval source, you currently 
get somewhat confusing behaviour:

* `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter - 
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats within 
ordered and unordered sources. It also changes the way that gaps are calculated within 
ordered and unordered sources, by using a new width() method on IntervalIterator. The 
default implementation just returns end() - start() + 1, but RepeatingIntervalsSource 
instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering 
on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches now always 
returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds 
gaps() and width() methods so that submatches can be combined in the same way that 
subiterators are. Extra checks have been added to checkIntervals() to ensure that the same 
intervals are returned by both iterator and matches, and a fix to 
DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes 
its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between 
the two methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants