-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate sum in Kahan summation algorithm in aggregations (#27807) #27848
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments to the avg aggregation that applies to other aggregations as well. It looks good to me. I think we should also modify the InternalSum, InternalAvg, ... aggregations to store the compensation as well, so that merging sums that come from multiple shards retains more accuracy. Finally I think we need to modify the extended stats aggregation as well?
@@ -46,6 +46,8 @@ | |||
DoubleArray sums; | |||
DocValueFormat format; | |||
|
|||
private DoubleArray compensations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please declare it next to sums
and using the same modifiers
double corrected = values.nextValue() - compensation; | ||
double newSum = sum + corrected; | ||
compensation = (newSum - sum) - corrected; | ||
sum = newSum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment saying this is Kahan summation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
assertEquals(expectedSum, reduced.getValue(), 0.000d); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why I have to modify this test case is because if there is a Double.NEGATIVE_INFINITY
in the values to sum, Kahan summation
with get Double.NaN
, and naive summation will get Double.NEGATIVE_INFINITY
, I'm not sure if this use case should be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We started rejecting this on new indices recently, but old indices might still have infinities. I think we should have special handling for these values in the aggregators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 15cdec2
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this looks almost good to me. I left some minor comments on the avg
aggregation that apply to other aggs as well. Could you also add some tests for the Infinity/NaN corner-cases, including things like summing up a few finite doubles that are so large that the sum is infinite?
for (int i = 0; i < valueCount; i++) { | ||
sum += values.nextValue(); | ||
double value = values.nextValue(); | ||
if (Double.isNaN(value) || Double.isInfinite(value)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could test both at once by doing Double.isFinite(value) == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's much better.
sum += value; | ||
if (Double.isNaN(sum)) | ||
break; | ||
} else if (Double.isFinite(sum)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is always true I think? So we could make it an else
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check statement is needed. For example when summing up [Double.POSITIVE_INFINITY, 1, 2, 3]
, if no this check, we'll got NaN
, and we expect Double.POSITIVE_INFINITY
here.
double value = values.nextValue(); | ||
if (Double.isNaN(value) || Double.isInfinite(value)) { | ||
sum += value; | ||
if (Double.isNaN(sum)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not try to break once the sum is NaN, this doesn't bring much imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
sum += ((InternalAvg) aggregation).sum; | ||
InternalAvg avg = (InternalAvg) aggregation; | ||
count += avg.count; | ||
if (Double.isNaN(sum) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this if statement to keep things simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
af772c4
to
1f07cca
Compare
Hi @jpountz Thanks for your comments. They're really useful to me. 👍 When I trying to add test case for some corner cases for
The result of |
Sorry for the long time with no response. I just fetched your code, and this is due to the fact that you use |
b6f5984
to
176be1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found one corner case, but other than that, it looks good to me!
double newSum = sum + corrected; | ||
compensation = (newSum - sum) - corrected; | ||
sum = newSum; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one corner-case is not covered with this logic. Imagine that all values are finite but 2: -Inf and +Inf, for instance:
[+Inf, 4, -Inf]
. The expected result is NaN
but your logic will make it return +Inf
if I'm not mistaken. Maybe this should be something like that instead:
if (Double.isFinite(value) == false || Double.isFinite(sum) == false) {
sum += value;
} else {
// kahan summation
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jpountz Thanks a lot! I made a test for both of them:
double sum = 0;
double compensation = 0;
double[] values = new double[]{Double.POSITIVE_INFINITY, 4, Double.NEGATIVE_INFINITY};
for (int i = 0; i < values.length; i++) {
double value = values[i];
if (Double.isFinite(value) == false) {
sum += value;
} else if (Double.isFinite(sum)) {
double corrected = value - compensation;
double newSum = sum + corrected;
compensation = (newSum - sum) - corrected;
sum = newSum;
}
}
System.out.println(sum);
sum = 0;
compensation = 0;
for (int i = 0; i < values.length; i++) {
double value = values[i];
if (Double.isFinite(value) == false || Double.isFinite(sum) == false) {
sum += value;
} else {
double corrected = value - compensation;
double newSum = sum + corrected;
compensation = (newSum - sum) - corrected;
sum = newSum;
}
}
System.out.println(sum);
Both of their result are NaN
. Because in my code, for each value, if it's not finite, it will be summed up to sum
, no matter what sum
is. I can also made this change because I also think your way is more intuitive. I also have a random test case for this kind corner-case:
int n = randomIntBetween(5, 10);
values = new double[n];
double sum = 0;
for (int i = 0; i < n; i++) {
values[i] = frequently()
? randomFrom(Double.NaN, Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY)
: randomDoubleBetween(Double.MIN_VALUE, Double.MAX_VALUE, true);
sum += values[i];
}
verifyAvgOfDoubles(values, sum / n, 1e-10);
I can make it be regular for example:
double[] values = new double[]{Double.POSITIVE_INFINITY, 4, Double.NEGATIVE_INFINITY};
verifyAvgOfDoubles(values, NaN, 0d);
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had misread your code, sorry! I am fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep things the way they are.
@elasticmachine please test it |
* master: Trim down usages of `ShardOperationFailedException` interface (#28312) Do not return all indices if a specific alias is requested via get aliases api. [Test] Lower bwc version for rank-eval rest tests CountedBitSet doesn't need to extend BitSet. (#28239) Calculate sum in Kahan summation algorithm in aggregations (#27807) (#27848) Remove the `update_all_types` option. (#28288) Add information when master node left to DiscoveryNodes' shortSummary() (#28197) Provide explanation of dangling indices, fixes #26008 (#26999)
* 6.x: Trim down usages of `ShardOperationFailedException` interface (#28312) Clean up commits when global checkpoint advanced (#28140) Do not return all indices if a specific alias is requested via get aliases api. CountedBitSet doesn't need to extend BitSet. (#28239) Calculate sum in Kahan summation algorithm in aggregations (#27807) (#27848)
* master: (94 commits) Completely remove Painless Type from AnalyzerCaster in favor of Java Class. (elastic#28329) Fix spelling error Reindex: Wait for deletion in test Reindex: log more on rare test failure Ensure we protect Collections obtained from scripts from self-referencing (elastic#28335) [Docs] Fix asciidoc style in composite agg docs Adds the ability to specify a format on composite date_histogram source (elastic#28310) Provide a better error message for the case when all shards failed (elastic#28333) [Test] Re-Add integer_range and date_range field types for query builder tests (elastic#28171) Added Put Mapping API to high-level Rest client (elastic#27869) Revert change that does not return all indices if a specific alias is requested via get alias api. (elastic#28294) Painless: Replace Painless Type with Java Class during Casts (elastic#27847) Notify affixMap settings when any under the registered prefix matches (elastic#28317) Trim down usages of `ShardOperationFailedException` interface (elastic#28312) Do not return all indices if a specific alias is requested via get aliases api. [Test] Lower bwc version for rank-eval rest tests CountedBitSet doesn't need to extend BitSet. (elastic#28239) Calculate sum in Kahan summation algorithm in aggregations (elastic#27807) (elastic#27848) Remove the `update_all_types` option. (elastic#28288) Add information when master node left to DiscoveryNodes' shortSummary() (elastic#28197) ...
Currently, when computing sum of double values in aggregators, all values are summed up in naive summation. Kahan summation is a compensated summation algorithm which is more accurate for double values. However for NaN and infinities, Kahan summation will not get the same result as naive summation.
Closes #27807