-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure numeric matching respectes precisions as described in our documentation #166
Conversation
Rishi, should we have a close look now or do you want to polish some more first? |
hey @timbray this isn't ready yet to review. Still polishing. Didn't realize github sends out a notification even for draft PRs. |
No prob. One request: when you think it's stable, it would be useful to include any new language in README.md or wherever that states the constraints on numeric values. Formerly: +/-5B, 6 fractional digits. Or maybe it doesn't change? |
Alright this one is ready for some scrutin @timbray . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, with a few comments on comments.
In earlier versions of this PR, you had remarked that there was one controversial part, where you fell back to parsing hex versions of numbers in the data. I think that is now gone? I didn't see it.
One optimization that Quamina does makes a big difference.
- For each ByteMachine equivalent, there is a boolean field called hasNumbers, saying whether any values in the rules being represented contained a Comparable Number
- For each Field structure, each value has a boolean field called isNumber, saying whether the value in the event is a number that can be converted to a ComparableNumber.
Then when starting to evaluate Q's equivalent of a ByteMachine, we can say (this is Go syntax):
if vmFields.hasNumbers && eventField.IsNumber {
// proceed with generation of ComparableNumber from event for matching
Because the ComparableNumber generation is pretty expensive and in practice this lets you bypass a lot of them.
* Represents a number as a comparable string. | ||
* <br/> | ||
* Numbers are allowed in the range -5,000,000,000 to +5,000,000,000 (inclusive). | ||
* Comparisons are precise to 6 decimal places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this language. We're not talking about 6 digit of precision, we're specifically saying 15 digits of precision, with six to the right of the decimal.
* Numbers are treated as floating-point values. | ||
* <br> | ||
* Numbers are converted to strings by: | ||
* 1. Multiplying by 1,000,000 to remove the decimal point and then adding 5,000,000,000 (to remove negatives), then m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
* <br/> | ||
* Hexadecimal representation is used because: | ||
* 1. It saves 3 bytes of memory per number compared to decimal representation. | ||
* 2. It aligns with the radix used for IP addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, lexical ordering is consistent with the underlying numeric ordering
* results show that only 5 decimal places of precision can be guaranteed when using doubles. | ||
* <br/> | ||
* CAVEAT: | ||
* The current maximum number of 5,000,000,000 is selected as a balance between maintaining the committed 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The currant range of +/- 5,000,000,000
BTW, Quamina probably won't follow this path, because unfortunately Go doesn't have built-in BigDecimal, and the benefits of having 6 rather than 5 decimal digits is smaller than the cost of accepting an uncontrolled external dependency. Would hope that some future version of Go gets good decimal support because I like the approach in this PR. |
// maybe it is a hex, fall back to using double where precision isn't guaranteed | ||
// we keep existing behaviour of ignore after 6 decimal to avoid breaking backward compatibility | ||
// as an acceptable trade-off https://github.com/aws/event-ruler/issues/163 | ||
return new BigDecimal(Double.parseDouble(str)).setScale(MAX_DECIMAL_PRECISON, RoundingMode.DOWN); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this bit, for Hexadecimal floating point literals, we fallback to the older way method of ignoring beyond 6 decimal places because these are extremely rare and there's a decent chance that decimal errors are coming from Java's Double.parseDouble()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand "maybe it is a hex". Hex numbers are not legal JSON. So that comment should probably change.
Quamina's approach is different. In the case where it can't be turned into a comparable number, we leave it exactly as is in the Event. An example are those numbers in CityLots; you only get a match when the rule contains the exact same representation of the number, byte for byte.
I'm not sure which approach I prefer because I'm not familiar with the kinds of applications where these kind of weird numbers come up. E.g. I have no idea what kind of logic I'd want if I were processing CityLots records. I can imagine a future feature where you say something like
{"numeric": [ "=", 5.0123456], "precision": 11}
But anyhow, LGTM.
Left a comment here https://github.com/aws/event-ruler/pull/166/files#r1668969004.
Its there but implemented as a counter
|
// maybe it is a hex, fall back to using double where precision isn't guaranteed | ||
// we keep existing behaviour of ignore after 6 decimal to avoid breaking backward compatibility | ||
// as an acceptable trade-off https://github.com/aws/event-ruler/issues/163 | ||
return new BigDecimal(Double.parseDouble(str)).setScale(MAX_DECIMAL_PRECISON, RoundingMode.DOWN); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand "maybe it is a hex". Hex numbers are not legal JSON. So that comment should probably change.
Quamina's approach is different. In the case where it can't be turned into a comparable number, we leave it exactly as is in the Event. An example are those numbers in CityLots; you only get a match when the rule contains the exact same representation of the number, byte for byte.
I'm not sure which approach I prefer because I'm not familiar with the kinds of applications where these kind of weird numbers come up. E.g. I have no idea what kind of logic I'd want if I were processing CityLots records. I can imagine a future feature where you say something like
{"numeric": [ "=", 5.0123456], "precision": 11}
But anyhow, LGTM.
I didn't realize hex numbers aren't legal JSON. I had only looked at the types of numbers Java supports but missed checking if they are part of JSON spec or not. Let me remove this bit and associated tests for now. |
Issue #, if available: #163
Description of changes:
As was reported in issue 163, ruler today ignores precision and causes false matches for numbers or rules with high precision numbers. This change moves away from using
double
for doing arithmetic adjustments withinComparableNumber
.Along the way the API to generate comparable numbers is changed from using
Strings
instead ofDouble
. This allows for more accurate rule matching for numbers with 6+ digits without compromising on performance.A bunch of our tests needed to be changed / fixed as a result of this change. These have been fixed. We're added additional test cases to help catch precision issues in future.
Benchmark / Performance (for source code changes):
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.