Ensure numeric matching respectes precisions as described in our documentation #166

baldawar · 2024-07-05T16:58:56Z

Issue #, if available: #163

Description of changes:

As was reported in issue 163, ruler today ignores precision and causes false matches for numbers or rules with high precision numbers. This change moves away from using double for doing arithmetic adjustments within ComparableNumber.

Along the way the API to generate comparable numbers is changed from using Strings instead of Double. This allows for more accurate rule matching for numbers with 6+ digits without compromising on performance.

A bunch of our tests needed to be changed / fixed as a result of this change. These have been fixed. We're added additional test cases to help catch precision issues in future.

Benchmark / Performance (for source code changes):

/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/bin/java -Dvisualvm.id=30211868140673 -ea -Didea.test.cyclic.buffer.size=1048576 -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=50554:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8 -classpath /Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit5-rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit-rt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jfxswt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/packager.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/amazon-corretto-8.jdk/Contents/Home/lib/tools.jar:/Volumes/Unix/workspaces/event-ruler/target/test-classes:/Volumes/Unix/workspaces/event-ruler/target/classes:/Users/baldawar/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.17.1/jackson-databind-2.17.1.jar:/Users/baldawar/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.17.1/jackson-annotations-2.17.1.jar:/Users/baldawar/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.17.1/jackson-core-2.17.1.jar:/Users/baldawar/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar:/Users/baldawar/.m2/repository/junit/junit/4.13.2/junit-4.13.2.jar:/Users/baldawar/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar com.intellij.rt.junit.JUnitStarter -ideVersion5 -junit4 software.amazon.event.ruler.Benchmarks
High NameState Reuse Memory Benchmark
Before: 254.8 (1)
After: 361.5 (223380)
Per rule: 139043 (290)
Reading citylots2
Read 213068 events
EXACT events/sec: 204676.3
WILDCARD events/sec: 138897.0
PREFIX events/sec: 226909.5
PREFIX_EQUALS_IGNORE_CASE_RULES events/sec: 216973.5
SUFFIX events/sec: 227880.2
SUFFIX_EQUALS_IGNORE_CASE_RULES events/sec: 227151.4
EQUALS_IGNORE_CASE events/sec: 189057.7
NUMERIC events/sec: 111905.5
ANYTHING-BUT events/sec: 127585.6
ANYTHING-BUT-IGNORE-CASE events/sec: 112913.6
ANYTHING-BUT-PREFIX events/sec: 128122.7
ANYTHING-BUT-SUFFIX events/sec: 111379.0
ANYTHING-BUT-WILDCARD events/sec: 137997.4
COMPLEX_ARRAYS events/sec: 35546.9
PARTIAL_COMBO events/sec: 51132.2
COMBO events/sec: 20205.6
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 13579
Events/sec: 15691.0
 Rules/sec: 109837.0
Low NameState Reuse Memory Benchmark
Before: 1779.7 (1)
After: 1239.9 (2625460)
Per rule: -702800 (3418)
Before: 1861.7 (1)
After: 985.6 (3254415)
Per rule: -2190 (8)
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 4100
Events/sec: 51967.8
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1605
Events/sec: 132752.6
 Rules/sec: 483485143.9
Before: 2045.9 (1)
After: 662.6 (4469583)
Per rule: -3458 (11)
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 20022
Events/sec: 10641.7
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 12431
Events/sec: 17140.1
 Rules/sec: 119980.4
DEEP EXACT events/sec: 9090.9

Process finished with exit code 0

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

timbray · 2024-07-05T18:14:06Z

Rishi, should we have a close look now or do you want to polish some more first?

baldawar · 2024-07-05T19:56:14Z

hey @timbray this isn't ready yet to review. Still polishing. Didn't realize github sends out a notification even for draft PRs.

timbray · 2024-07-05T20:04:02Z

No prob. One request: when you think it's stable, it would be useful to include any new language in README.md or wherever that states the constraints on numeric values. Formerly: +/-5B, 6 fractional digits. Or maybe it doesn't change?

baldawar · 2024-07-05T23:46:55Z

Alright this one is ready for some scrutin @timbray .

timbray

Generally LGTM, with a few comments on comments.

In earlier versions of this PR, you had remarked that there was one controversial part, where you fell back to parsing hex versions of numbers in the data. I think that is now gone? I didn't see it.

One optimization that Quamina does makes a big difference.

For each ByteMachine equivalent, there is a boolean field called hasNumbers, saying whether any values in the rules being represented contained a Comparable Number
For each Field structure, each value has a boolean field called isNumber, saying whether the value in the event is a number that can be converted to a ComparableNumber.

Then when starting to evaluate Q's equivalent of a ByteMachine, we can say (this is Go syntax):

if vmFields.hasNumbers && eventField.IsNumber {
  // proceed with generation of ComparableNumber from event for matching

Because the ComparableNumber generation is pretty expensive and in practice this lets you bypass a lot of them.

timbray · 2024-07-08T16:26:17Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+ * Represents a number as a comparable string.
+ * <br/>
+ * Numbers are allowed in the range -5,000,000,000 to +5,000,000,000 (inclusive).
+ * Comparisons are precise to 6 decimal places.


not sure about this language. We're not talking about 6 digit of precision, we're specifically saying 15 digits of precision, with six to the right of the decimal.

timbray · 2024-07-08T16:26:44Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+ * Numbers are treated as floating-point values.
+ * <br>
+ * Numbers are converted to strings by:
+ * 1. Multiplying by 1,000,000 to remove the decimal point and then adding 5,000,000,000 (to remove negatives), then m


timbray · 2024-07-08T16:27:36Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+ * <br/>
+ * Hexadecimal representation is used because:
+ * 1. It saves 3 bytes of memory per number compared to decimal representation.
+ * 2. It aligns with the radix used for IP addresses.


Also, lexical ordering is consistent with the underlying numeric ordering

timbray · 2024-07-08T16:29:11Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+ * results show that only 5 decimal places of precision can be guaranteed when using doubles.
+ * <br/>
+ * CAVEAT:
+ * The current maximum number of 5,000,000,000 is selected as a balance between maintaining the committed 6


The currant range of +/- 5,000,000,000

timbray · 2024-07-08T16:48:07Z

BTW, Quamina probably won't follow this path, because unfortunately Go doesn't have built-in BigDecimal, and the benefits of having 6 rather than 5 decimal digits is smaller than the cost of accepting an uncontrolled external dependency. Would hope that some future version of Go gets good decimal support because I like the approach in this PR.

baldawar · 2024-07-08T16:49:19Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+            // maybe it is a hex, fall back to using double where precision isn't guaranteed
+            // we keep existing behaviour of ignore after 6 decimal to avoid breaking backward compatibility
+            // as an acceptable trade-off https://github.com/aws/event-ruler/issues/163
+            return new BigDecimal(Double.parseDouble(str)).setScale(MAX_DECIMAL_PRECISON, RoundingMode.DOWN);
+        }


In this bit, for Hexadecimal floating point literals, we fallback to the older way method of ignoring beyond 6 decimal places because these are extremely rare and there's a decent chance that decimal errors are coming from Java's Double.parseDouble()

I don't understand "maybe it is a hex". Hex numbers are not legal JSON. So that comment should probably change.

Quamina's approach is different. In the case where it can't be turned into a comparable number, we leave it exactly as is in the Event. An example are those numbers in CityLots; you only get a match when the rule contains the exact same representation of the number, byte for byte.

I'm not sure which approach I prefer because I'm not familiar with the kinds of applications where these kind of weird numbers come up. E.g. I have no idea what kind of logic I'd want if I were processing CityLots records. I can imagine a future feature where you say something like

{"numeric": [ "=", 5.0123456], "precision": 11}

But anyhow, LGTM.

baldawar · 2024-07-08T17:00:26Z

In earlier versions of this PR, you had remarked that there was one controversial part, where you fell back to parsing hex versions of numbers in the data. I think that is now gone? I didn't see it.

Left a comment here https://github.com/aws/event-ruler/pull/166/files#r1668969004.

One optimization that Quamina does makes a big difference.

Its there but implemented as a counter

event-ruler/src/main/software/amazon/event/ruler/ByteMachine.java

Line 115 in ccafd48

if (hasNumeric.get() > 0) {

timbray · 2024-07-08T17:05:56Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+            // maybe it is a hex, fall back to using double where precision isn't guaranteed
+            // we keep existing behaviour of ignore after 6 decimal to avoid breaking backward compatibility
+            // as an acceptable trade-off https://github.com/aws/event-ruler/issues/163
+            return new BigDecimal(Double.parseDouble(str)).setScale(MAX_DECIMAL_PRECISON, RoundingMode.DOWN);
+        }


I don't understand "maybe it is a hex". Hex numbers are not legal JSON. So that comment should probably change.

Quamina's approach is different. In the case where it can't be turned into a comparable number, we leave it exactly as is in the Event. An example are those numbers in CityLots; you only get a match when the rule contains the exact same representation of the number, byte for byte.

I'm not sure which approach I prefer because I'm not familiar with the kinds of applications where these kind of weird numbers come up. E.g. I have no idea what kind of logic I'd want if I were processing CityLots records. I can imagine a future feature where you say something like

{"numeric": [ "=", 5.0123456], "precision": 11}

But anyhow, LGTM.

baldawar · 2024-07-08T17:09:58Z

I didn't realize hex numbers aren't legal JSON. I had only looked at the types of numbers Java supports but missed checking if they are part of JSON spec or not.

Let me remove this bit and associated tests for now.

…upported

baldawar added 5 commits July 3, 2024 14:26

backup functional precision checker

1e2761e

backing up v2 that uses bigdecimal instead

478508b

bkup w working tests and mostly cleaned up code

16a980c

Fix javadocs, variables, and tests

2397ea9

add a reminder to update pull -request

09d30ee

handle hex numbers which would stop matching out of the blue

fb4de9c

Update performance benchmarks

c1f8677

baldawar changed the title ~~[WIP] Imprecision~~ Ensure numeric matching respectes precisions as described in our documentation Jul 5, 2024

baldawar added 3 commits July 5, 2024 16:19

version bump

18d927e

checkstyle corrections

6ad601d

Fix broken tests

157d1c9

baldawar marked this pull request as ready for review July 5, 2024 23:41

baldawar mentioned this pull request Jul 5, 2024

Numeric matching ignores its own precision limitations. #163

Closed

timbray reviewed Jul 8, 2024

View reviewed changes

baldawar commented Jul 8, 2024

View reviewed changes

Fixing java doc errors within ComparableNumber

8cfa83d

timbray approved these changes Jul 8, 2024

View reviewed changes

Remove Hexadecimals. not part of JSON spec and so do not need to be s…

7c42b39

…upported

baldawar merged commit 23e75a2 into main Jul 8, 2024
3 checks passed

baldawar deleted the imprecision branch July 8, 2024 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure numeric matching respectes precisions as described in our documentation #166

Ensure numeric matching respectes precisions as described in our documentation #166

baldawar commented Jul 5, 2024 •

edited

Loading

timbray commented Jul 5, 2024

baldawar commented Jul 5, 2024

timbray commented Jul 5, 2024

baldawar commented Jul 5, 2024 •

edited

Loading

timbray left a comment

timbray Jul 8, 2024

timbray Jul 8, 2024

timbray Jul 8, 2024

timbray Jul 8, 2024

timbray commented Jul 8, 2024

baldawar Jul 8, 2024

timbray Jul 8, 2024

baldawar commented Jul 8, 2024

timbray Jul 8, 2024

baldawar commented Jul 8, 2024

Ensure numeric matching respectes precisions as described in our documentation #166

Ensure numeric matching respectes precisions as described in our documentation #166

Conversation

baldawar commented Jul 5, 2024 • edited Loading

Issue #, if available: #163

Description of changes:

Benchmark / Performance (for source code changes):

timbray commented Jul 5, 2024

baldawar commented Jul 5, 2024

timbray commented Jul 5, 2024

baldawar commented Jul 5, 2024 • edited Loading

timbray left a comment

Choose a reason for hiding this comment

timbray Jul 8, 2024

Choose a reason for hiding this comment

timbray Jul 8, 2024

Choose a reason for hiding this comment

timbray Jul 8, 2024

Choose a reason for hiding this comment

timbray Jul 8, 2024

Choose a reason for hiding this comment

timbray commented Jul 8, 2024

baldawar Jul 8, 2024

Choose a reason for hiding this comment

timbray Jul 8, 2024

Choose a reason for hiding this comment

baldawar commented Jul 8, 2024

timbray Jul 8, 2024

Choose a reason for hiding this comment

baldawar commented Jul 8, 2024

baldawar commented Jul 5, 2024 •

edited

Loading

baldawar commented Jul 5, 2024 •

edited

Loading