Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure numeric matching respectes precisions as described in our documentation #166

Merged
merged 12 commits into from
Jul 8, 2024

Conversation

baldawar
Copy link
Collaborator

@baldawar baldawar commented Jul 5, 2024

Issue #, if available: #163

Description of changes:

As was reported in issue 163, ruler today ignores precision and causes false matches for numbers or rules with high precision numbers. This change moves away from using double for doing arithmetic adjustments within ComparableNumber.

Along the way the API to generate comparable numbers is changed from using Strings instead of Double. This allows for more accurate rule matching for numbers with 6+ digits without compromising on performance.

A bunch of our tests needed to be changed / fixed as a result of this change. These have been fixed. We're added additional test cases to help catch precision issues in future.

Benchmark / Performance (for source code changes):

/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/bin/java -Dvisualvm.id=30211868140673 -ea -Didea.test.cyclic.buffer.size=1048576 -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=50554:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8 -classpath /Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit5-rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit-rt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jfxswt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/packager.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/tools.jar:/Volumes/Unix/workspaces/event-ruler/target/test-classes:/Volumes/Unix/workspaces/event-ruler/target/classes:/Users/baldawar/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.17.1/jackson-databind-2.17.1.jar:/Users/baldawar/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.17.1/jackson-annotations-2.17.1.jar:/Users/baldawar/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.17.1/jackson-core-2.17.1.jar:/Users/baldawar/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar:/Users/baldawar/.m2/repository/junit/junit/4.13.2/junit-4.13.2.jar:/Users/baldawar/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar com.intellij.rt.junit.JUnitStarter -ideVersion5 -junit4 software.amazon.event.ruler.Benchmarks
High NameState Reuse Memory Benchmark
Before: 254.8 (1)
After: 361.5 (223380)
Per rule: 139043 (290)
Reading citylots2
Read 213068 events
EXACT events/sec: 204676.3
WILDCARD events/sec: 138897.0
PREFIX events/sec: 226909.5
PREFIX_EQUALS_IGNORE_CASE_RULES events/sec: 216973.5
SUFFIX events/sec: 227880.2
SUFFIX_EQUALS_IGNORE_CASE_RULES events/sec: 227151.4
EQUALS_IGNORE_CASE events/sec: 189057.7
NUMERIC events/sec: 111905.5
ANYTHING-BUT events/sec: 127585.6
ANYTHING-BUT-IGNORE-CASE events/sec: 112913.6
ANYTHING-BUT-PREFIX events/sec: 128122.7
ANYTHING-BUT-SUFFIX events/sec: 111379.0
ANYTHING-BUT-WILDCARD events/sec: 137997.4
COMPLEX_ARRAYS events/sec: 35546.9
PARTIAL_COMBO events/sec: 51132.2
COMBO events/sec: 20205.6
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 13579
Events/sec: 15691.0
 Rules/sec: 109837.0
Low NameState Reuse Memory Benchmark
Before: 1779.7 (1)
After: 1239.9 (2625460)
Per rule: -702800 (3418)
Before: 1861.7 (1)
After: 985.6 (3254415)
Per rule: -2190 (8)
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 4100
Events/sec: 51967.8
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1605
Events/sec: 132752.6
 Rules/sec: 483485143.9
Before: 2045.9 (1)
After: 662.6 (4469583)
Per rule: -3458 (11)
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 20022
Events/sec: 10641.7
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 12431
Events/sec: 17140.1
 Rules/sec: 119980.4
DEEP EXACT events/sec: 9090.9

Process finished with exit code 0


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@timbray
Copy link
Collaborator

timbray commented Jul 5, 2024

Rishi, should we have a close look now or do you want to polish some more first?

@baldawar
Copy link
Collaborator Author

baldawar commented Jul 5, 2024

hey @timbray this isn't ready yet to review. Still polishing. Didn't realize github sends out a notification even for draft PRs.

@timbray
Copy link
Collaborator

timbray commented Jul 5, 2024

No prob. One request: when you think it's stable, it would be useful to include any new language in README.md or wherever that states the constraints on numeric values. Formerly: +/-5B, 6 fractional digits. Or maybe it doesn't change?

@baldawar baldawar changed the title [WIP] Imprecision Ensure numeric matching respectes precisions as described in our documentation Jul 5, 2024
@baldawar baldawar marked this pull request as ready for review July 5, 2024 23:41
@baldawar
Copy link
Collaborator Author

baldawar commented Jul 5, 2024

Alright this one is ready for some scrutin @timbray .

Copy link
Collaborator

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, with a few comments on comments.

In earlier versions of this PR, you had remarked that there was one controversial part, where you fell back to parsing hex versions of numbers in the data. I think that is now gone? I didn't see it.

One optimization that Quamina does makes a big difference.

  1. For each ByteMachine equivalent, there is a boolean field called hasNumbers, saying whether any values in the rules being represented contained a Comparable Number
  2. For each Field structure, each value has a boolean field called isNumber, saying whether the value in the event is a number that can be converted to a ComparableNumber.

Then when starting to evaluate Q's equivalent of a ByteMachine, we can say (this is Go syntax):

if vmFields.hasNumbers && eventField.IsNumber {
  // proceed with generation of ComparableNumber from event for matching

Because the ComparableNumber generation is pretty expensive and in practice this lets you bypass a lot of them.

* Represents a number as a comparable string.
* <br/>
* Numbers are allowed in the range -5,000,000,000 to +5,000,000,000 (inclusive).
* Comparisons are precise to 6 decimal places.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this language. We're not talking about 6 digit of precision, we're specifically saying 15 digits of precision, with six to the right of the decimal.

* Numbers are treated as floating-point values.
* <br>
* Numbers are converted to strings by:
* 1. Multiplying by 1,000,000 to remove the decimal point and then adding 5,000,000,000 (to remove negatives), then m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

* <br/>
* Hexadecimal representation is used because:
* 1. It saves 3 bytes of memory per number compared to decimal representation.
* 2. It aligns with the radix used for IP addresses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, lexical ordering is consistent with the underlying numeric ordering

* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* <br/>
* CAVEAT:
* The current maximum number of 5,000,000,000 is selected as a balance between maintaining the committed 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The currant range of +/- 5,000,000,000

@timbray
Copy link
Collaborator

timbray commented Jul 8, 2024

BTW, Quamina probably won't follow this path, because unfortunately Go doesn't have built-in BigDecimal, and the benefits of having 6 rather than 5 decimal digits is smaller than the cost of accepting an uncontrolled external dependency. Would hope that some future version of Go gets good decimal support because I like the approach in this PR.

Comment on lines 90 to 94
// maybe it is a hex, fall back to using double where precision isn't guaranteed
// we keep existing behaviour of ignore after 6 decimal to avoid breaking backward compatibility
// as an acceptable trade-off https://github.com/aws/event-ruler/issues/163
return new BigDecimal(Double.parseDouble(str)).setScale(MAX_DECIMAL_PRECISON, RoundingMode.DOWN);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this bit, for Hexadecimal floating point literals, we fallback to the older way method of ignoring beyond 6 decimal places because these are extremely rare and there's a decent chance that decimal errors are coming from Java's Double.parseDouble()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand "maybe it is a hex". Hex numbers are not legal JSON. So that comment should probably change.

Quamina's approach is different. In the case where it can't be turned into a comparable number, we leave it exactly as is in the Event. An example are those numbers in CityLots; you only get a match when the rule contains the exact same representation of the number, byte for byte.

I'm not sure which approach I prefer because I'm not familiar with the kinds of applications where these kind of weird numbers come up. E.g. I have no idea what kind of logic I'd want if I were processing CityLots records. I can imagine a future feature where you say something like

{"numeric": [ "=", 5.0123456], "precision": 11}

But anyhow, LGTM.

@baldawar
Copy link
Collaborator Author

baldawar commented Jul 8, 2024

In earlier versions of this PR, you had remarked that there was one controversial part, where you fell back to parsing hex versions of numbers in the data. I think that is now gone? I didn't see it.

Left a comment here https://github.com/aws/event-ruler/pull/166/files#r1668969004.

One optimization that Quamina does makes a big difference.

Its there but implemented as a counter

Comment on lines 90 to 94
// maybe it is a hex, fall back to using double where precision isn't guaranteed
// we keep existing behaviour of ignore after 6 decimal to avoid breaking backward compatibility
// as an acceptable trade-off https://github.com/aws/event-ruler/issues/163
return new BigDecimal(Double.parseDouble(str)).setScale(MAX_DECIMAL_PRECISON, RoundingMode.DOWN);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand "maybe it is a hex". Hex numbers are not legal JSON. So that comment should probably change.

Quamina's approach is different. In the case where it can't be turned into a comparable number, we leave it exactly as is in the Event. An example are those numbers in CityLots; you only get a match when the rule contains the exact same representation of the number, byte for byte.

I'm not sure which approach I prefer because I'm not familiar with the kinds of applications where these kind of weird numbers come up. E.g. I have no idea what kind of logic I'd want if I were processing CityLots records. I can imagine a future feature where you say something like

{"numeric": [ "=", 5.0123456], "precision": 11}

But anyhow, LGTM.

@baldawar
Copy link
Collaborator Author

baldawar commented Jul 8, 2024

I didn't realize hex numbers aren't legal JSON. I had only looked at the types of numbers Java supports but missed checking if they are part of JSON spec or not.

Let me remove this bit and associated tests for now.

@baldawar baldawar merged commit 23e75a2 into main Jul 8, 2024
3 checks passed
@baldawar baldawar deleted the imprecision branch July 8, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants