You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Specifically, the bg_count is wrong in the concurrent search case. In a Significant terms aggregation, bg_count is the number of docs with the term in the background set while doc_count is the number of docs with the term in the foreground set.
So at the top level, the background set is the number of docs in the index with the “class” field, so the bg_count should simply be the number of docs in the index. Code ref:
Background Set == All documents in the index with “class” field
At the inner level, the bg_count therefore refers to all of the documents in the background set with the given key in the “text” field. See how index is created here:
Here the outer level bg_count refers to the number of documents with the field crime_type in the background set, which is the entire index. The query is for force == British Transport Police, so doc_count represents the number of documents where the force field matches British Transport Police, also known as our foreground set. Background Set == Number of crimes in the index
Foreground Set == Number of crimes where force == British Transport Police
So now looking at the inner level, bg_count is going to be the number of documents where crime_type == Bicycle theft in the Background set, while doc_count is going to be the number of documents matching the same condition in the Foreground set.
During the buildAggregations phase, the bg_count is gathered per collector. In the non-concurrent search case, that means it will only happen once per shard, and then at the coordinator level reduce can just sum up the bg_count across all of the shards to get the total. However, in the concurrent search use case we will have a collector per segment slice, and then at the shard-level reduce these will get summed up, so we will end up with SEGMENT_SLICE_COUNT * BG_COUNT documents.
Small caveat here is that this will only happen for slices that actually have docs, so in our test example, if a slice does not have any documents that have “text” field “0” then that collector will not contribute to the BG_COUNT miscount.
Here is an example to illustrate this problem:
In a test run, if the buckets look like this
Without a deeper understanding of the scoring mechanism, it would be best for a solution to fix the scores in the slice level rather than the coordinator level.
I have added a few possible solutions below, with solution 3 looking the most promising to me and I will be working on a PR for that.
1. Divide by the bucket count
Due to the caveat above, in sparse data cases if a bucket is not collected in a segment slice then it won’t contribute to the bg_count. That means we can’t just blindly divide these numbers by the slice count. We would need to figure out some way to get the bucket count per collector to see how to divide the bg_count.
2. Separate out the shard-level vs slice-level reduce logic
3. Create a singleton SignificanceLookup that will only return backgroundFrequencies and supersetSize once per bucket
This seems to be the best solution to me because the sum bg_count will be correct going into the aggs level reduce so we won't need to modify that. Moreover, we're actually doing up to 2 * NUM_SLICES * NUM_BUCKETS extra queries to get the bg_count for every single bucket at the slice level, so this singleton solution will eliminate all but up to 2 of these extra queries.
Currently I am thinking we can use searchContext.isConcurrentSegmentSearchEnabled() here:
However I'm seeing assert codec failures in that execution path as mentioned in #8509 (comment). Nevertheless the same solution should be applicable in both cases here.
I still need to hash out some of the finer details here, such as how to reset the Singleton class per bucket, however conceptually this solution is both the most intuitive and least intrusive.
The text was updated successfully, but these errors were encountered:
After taking a look at this, I think solution 2 is actually much more straightforward. ReduceContext already has a isFinalReduce() method that is true for the coordinator level reduce. We can use that method in the places mentioned above and then we can leave the slice level query optimization for later. Will open a PR with this implementation shortly.
Subtask as a part of #8509
To Reproduce
Details
This test is failing due to an unexpected result in the query response:
Expected:
Actual:
Specifically, the bg_count is wrong in the concurrent search case. In a Significant terms aggregation, bg_count is the number of docs with the term in the background set while doc_count is the number of docs with the term in the foreground set.
So at the top level, the background set is the number of docs in the index with the “class” field, so the bg_count should simply be the number of docs in the index. Code ref:
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SignificanceLookup.java
Line 100 in 8eea7b9
Background Set == All documents in the index with “class” field
At the inner level, the bg_count therefore refers to all of the documents in the background set with the given key in the “text” field. See how index is created here:
OpenSearch/server/src/test/java/org/opensearch/test/search/aggregations/bucket/SharedSignificantTermsTestMethods.java
Lines 94 to 115 in b9f0440
So for key “0” this is 5, and for key “1” this is 4. This value is retrieved through a querying the index from the search thread. Code ref:
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/NumericTermsAggregator.java
Lines 647 to 656 in 8eea7b9
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Line 897 in e354201
Just to crystalize the meanings of background and foreground sets, let’s break down the example in the ElasticSearch docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_single_set_analysis
Here the outer level bg_count refers to the number of documents with the field crime_type in the background set, which is the entire index. The query is for force == British Transport Police, so doc_count represents the number of documents where the force field matches British Transport Police, also known as our foreground set.
Background Set == Number of crimes in the index
Foreground Set == Number of crimes where force == British Transport Police
So now looking at the inner level, bg_count is going to be the number of documents where crime_type == Bicycle theft in the Background set, while doc_count is going to be the number of documents matching the same condition in the Foreground set.
References:
During the buildAggregations phase, the bg_count is gathered per collector. In the non-concurrent search case, that means it will only happen once per shard, and then at the coordinator level reduce can just sum up the bg_count across all of the shards to get the total. However, in the concurrent search use case we will have a collector per segment slice, and then at the shard-level reduce these will get summed up, so we will end up with SEGMENT_SLICE_COUNT * BG_COUNT documents.
Small caveat here is that this will only happen for slices that actually have docs, so in our test example, if a slice does not have any documents that have “text” field “0” then that collector will not contribute to the BG_COUNT miscount.
Here is an example to illustrate this problem:
In a test run, if the buckets look like this
Then there are 5 buckets total, 2 for key “1” and 3 for key “0”, so if you look at the results below:
Expected:
Actual:
We see that key “1” has 2x the expected count while key “0” has 3x.
Single set case for reference:
Solutions
Without a deeper understanding of the scoring mechanism, it would be best for a solution to fix the scores in the slice level rather than the coordinator level.
I have added a few possible solutions below, with solution 3 looking the most promising to me and I will be working on a PR for that.
1. Divide by the bucket count
Due to the caveat above, in sparse data cases if a bucket is not collected in a segment slice then it won’t contribute to the bg_count. That means we can’t just blindly divide these numbers by the slice count. We would need to figure out some way to get the bucket count per collector to see how to divide the bg_count.
2. Separate out the shard-level vs slice-level reduce logic
Slice level reduce:
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/AggregationCollectorManager.java
Lines 70 to 74 in 064f265
Add a flag to reduce context and then don’t sum up supersetDf here:
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalSignificantTerms.java
Lines 286 to 299 in e354201
Nor supersetSize here:
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalSignificantTerms.java
Lines 231 to 236 in e354201
3. Create a singleton
SignificanceLookup
that will only return backgroundFrequencies and supersetSize once per bucketThis seems to be the best solution to me because the sum
bg_count
will be correct going into the aggs level reduce so we won't need to modify that. Moreover, we're actually doing up to 2 * NUM_SLICES * NUM_BUCKETS extra queries to get thebg_count
for every single bucket at the slice level, so this singleton solution will eliminate all but up to 2 of these extra queries.Currently I am thinking we can use
searchContext.isConcurrentSegmentSearchEnabled()
here:OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SignificantTermsAggregatorFactory.java
Lines 290 to 295 in 9b9158e
and then further use the
queryShardContext
to determine when to rest the Singleton.Note that we will most likely also need to do this in
SignificantTextAggregatorFactory
here:OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SignificantTextAggregatorFactory.java
Line 157 in 8eea7b9
However I'm seeing assert codec failures in that execution path as mentioned in #8509 (comment). Nevertheless the same solution should be applicable in both cases here.
I still need to hash out some of the finer details here, such as how to reset the Singleton class per bucket, however conceptually this solution is both the most intuitive and least intrusive.
The text was updated successfully, but these errors were encountered: