Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize composite aggregation based on index sorting #48399

Merged
merged 26 commits into from
Dec 17, 2019

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Oct 23, 2019

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:

  • Multi-valued source, in such case early termination is not possible.
  • missing_bucket is set to true

I retained the commit from #48130 and merged the original idea that @howardhuanghua described in the pr to early terminate even when the sort is reversed in the leading source.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@jimczi
Copy link
Contributor Author

jimczi commented Oct 23, 2019

@howardhuanghua I couldn't push directly to your branch so I opened a new pr that retains your commit. I hope you don't mind if we continue to work here since we both have the right to push modifications. I merged the two approaches that we discussed in the original pr so I think this pr is ready for a review. We still need to update the documentation but I wanted to get your opinion first.

@howardhuanghua
Copy link
Contributor

Hi @jimczi , I am ok and thanks for your merging. It's my pleasure that I could contribute on this optimization with you together. I left a comment about the merged code, please help to check. Thank you.

@jimczi
Copy link
Contributor Author

jimczi commented Oct 31, 2019

Thanks for looking @jpountz and @howardhuanghua . I pushed some commits to address your comments and added a section in the docs to explain the early termination. Could you take another look ?

private void testSearchCase(List<Query> queries,
List<Map<String, List<Object>>> dataset,
Supplier<CompositeAggregationBuilder> create,
Consumer<InternalComposite> verify) throws IOException {
for (Query query : queries) {
executeTestCase(false, query, dataset, create, verify);
executeTestCase(true, query, dataset, create, verify);
executeTestCase(false, false, query, dataset, create, verify);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these two TestCase set useIndexSort as false, then it could not be early terminated. This would make testEarlyTermination() failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testEarlyTermination uses executeTestCase directly and sets useIndexSort to true so I think we're good here. However I found a test bug when the AssertingCodec is used, this codec wraps docvalues and that breaks DocValues#unwrapSingleton contract. So I pushed 1fa763b to suppress this code in the early termination tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even suppressed codecs, it seems still got the chance to use
keyword -> {AssertingDocValuesFormat$AssertingDocValuesProducer@4968}. And testEarlyTermination would failed occasionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch thanks, I pushed b52d5c7 to force the suppression of the Asserting codec but that means that the test is ignored when the randomization choose this codec. I need to dig a bit more to understand how we can ensure that the test always runs.

Copy link
Contributor

@howardhuanghua howardhuanghua Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jimczi , I have checked that system properties like tests.codec, tests.docvaluesformat and tests.postingsformat default value is random in LuceneTestCase.java, AssertingCodec would be selected.

It seems we could set default Lucene80 codec directly in executeTestCase method as follow to solve the problem:

if (indexSort != null) {
      config.setCodec(TestUtil.getDefaultCodec()); // set codec as default to avoid AssertingCodec
      config.setIndexSort(indexSort);
}


The `composite` aggregation can early terminate the collection under some circumstances:

* If the primary source extracts values from a numeric or a keyword field and the query matches all documents (`match_all` query).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early terminate optimization could also be used for non match_all query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if the index is sorted, yes. I also wanted to mention that we early terminate if the index is not sorted but I see how it's confusing now. I pushed 1fa763b to reword the paragraph to emphasize on index sorting first.

@jimczi
Copy link
Contributor Author

jimczi commented Nov 12, 2019

I pushed another commit to address the remaining TODOs and the reviews. There is still one TODO left regarding how we could handle missingBucket but the pr is quite big already so I think we can handle it in a follow up.

@howardhuanghua
Copy link
Contributor

Hi @jimczi , LGTM. Thanks for your great efforts to fix these issues.

@howardhuanghua
Copy link
Contributor

Hi @jimczi, would you please help to merge this PR? Or we need further review? Thank you.

@iverase iverase self-requested a review December 9, 2019 14:36
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass and left some questions. I will be doing a second pass and focus on the tests.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic makes sense to me. lgtm

@jimczi jimczi merged commit 804a504 into master Dec 17, 2019
@jimczi jimczi deleted the composite_sorted_index_optim branch December 17, 2019 13:02
@jimczi
Copy link
Contributor Author

jimczi commented Dec 17, 2019

Thanks @howardhuanghua and sorry for the delay. I merged in master and added you as co-author. I'll now backport to 7.x.

jimczi added a commit that referenced this pull request Dec 17, 2019
Co-authored-by: Daniel Huang <[email protected]>

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
@howardhuanghua
Copy link
Contributor

@jimczi It's my pleasure to co-work with you on this optimization. Thanks for for your help.

jimczi added a commit that referenced this pull request Dec 20, 2019
Co-authored-by: Daniel Huang <[email protected]>

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Co-authored-by: Daniel Huang <[email protected]>

This is a spinoff of elastic#48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
@@ -73,6 +86,8 @@
private RoaringDocIdSet.Builder docIdSetBuilder;
private BucketCollector deferredCollectors;

private boolean earlyTerminated;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提前结束

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants