refactor + impute #282

sudiptoguha · 2021-10-15T15:57:44Z

Description of changes: Refactors ThresholdedRCF for impute.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

kaituo

Partial reviews. Finished 7/17 files.

Java/core/src/main/java/com/amazon/randomcutforest/config/ImputationMethod.java

kaituo · 2021-10-20T00:15:28Z

...parkservices/src/main/java/com/amazon/randomcutforest/parkservices/RCFComputeDescriptor.java

+@Setter
+public class RCFComputeDescriptor {
+
+    // sequence index (the number of updates to RCF) -- it is possible in imputation


Could you explain why the number of updates can be more than the input tuples seen by the overall program? Can the following be a cause?

Say I have point 1-10, and point 15, and I have internal shingling enabled, my shingle size is 6, to get the ball rolling, I need to impute point 11-14. Then I need to update rcf with point 11~14 and the total updates is 4 more than the input tuples seen by the overall program.

Also, can this happen (the number of updates more than the input tuples seen by the overall program) in external shingling?

One of the concerns, related to any model is that "would we update the model with the values which we just imputed from the same model?" There are some scenarios (specially if number of imputations being low) where that makes sense. But if there are more errors/missing entries then it may make sense to control that -- this is done by the useImputedFraction (we only admit points where the ratio of imputed to total is below useImputedFraction).

...parkservices/src/main/java/com/amazon/randomcutforest/parkservices/RCFComputeDescriptor.java

...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java

kaituo

Comments after reading a few files from 52b7e2a

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java

...arkservices/src/main/java/com/amazon/randomcutforest/parkservices/IRCFComputeDescriptor.java

kaituo · 2021-10-22T19:21:47Z

...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java


        // add explanation
-        preprocessor.postProcess(description, forest);
+        preprocessor.postProcess(description, lastAnomalyDescriptor, forest);

        if (ifZero) { // turn caching off


what if there is exception thrown during above steps and the program returns early and you won't be able to change the cache fraction back?

During serialization the cache is always set to 0 and ignored. So there is no issue with serialization. If there is an exception thrown -- isn't there a bigger issue of what that exception is? If the fraction remains small then correctness is not affected, but speed is (and the next evaluation fixes it).

Revisiting this: this can be an issue for recovery strategies and the fraction may remain set at 1, so the memory will continue to be used till the model is serialized -- maybe resolve in next PR. Will need small change to BoxCache.

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java

...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java

kaituo · 2021-10-24T04:17:42Z

...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java

+                        double[] scaledInput = transformValues(result, factors);
+                        updateShingle(result, scaledInput);
+                        updateTimestamps(initialTimeStamps[i]);
+                        numberOfImputed = numberOfImputed + 1;


Can we do numberOfImputed += numberToImpute at the beginning of the for loop?

Well, the decision to "allow" a tuple depends on the number of imputations (made-up information). We can move this up if we do not want such control (but I think we do).

...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java

kaituo · 2021-10-26T04:31:06Z

...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java

+        T answer = preprocessor.postProcess(core.apply(preprocessor.preProcess(input, lastAnomalyDescriptor, forest)),
+                lastAnomalyDescriptor, forest);
+        if (ifZero) { // turn caching off
+            forest.setBoundingBoxCacheFraction(0);


re what we discussed earlier:

we need to make sure to demolish the bounding box in BoxCache.setCacheFraction.

We need to make sure to call forest.setBoundingBoxCacheFraction(0) in the presence of an exception. try..finally.. is one choice.

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java

...rvices/src/main/java/com/amazon/randomcutforest/parkservices/threshold/BasicThresholder.java

kaituo · 2021-10-29T04:13:58Z

...parkservices/src/main/java/com/amazon/randomcutforest/parkservices/RCFComputeDescriptor.java

+    // the inputlength; useful for standalone analysis
+    int inputLength;


Can you explain more about the shingle size part? You are referring to the following code in Preprocessor, right?

AnomalyDescriptor initialSetup(AnomalyDescriptor description, IRCFComputeDescriptor lastAnomalyDescriptor, RandomCutForest forest) { ... description.setShingleSize(shingleSize);

shingleSize is a field of PreProcessor and the preprocessor get the shingle size when we construct trcf. If preprocessor can get the correct shingle size, why cannot forest get it?

...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java

jotok · 2021-10-21T16:38:54Z

...src/main/java/com/amazon/randomcutforest/examples/parkservices/Thresholded1DGaussianMix.java

-                    && result.getTimestamp() == dataWithKeys.changeIndices[keyCounter]) {
-                System.out.println("timestamp " + (result.getTimestamp()) + " CHANGE");
+                    && result.getInternalTimeStamp() == dataWithKeys.changeIndices[keyCounter]) {
+                System.out.println("timestamp " + (result.getInputTimestamp()) + " CHANGE");


Are all these print statements (here and elsewhere in the PR) intended to be in the final code, or were they used for debugging?

Here (in the examples package) the print statements are for the purposes of demonstration -- how do we use the different fields, as as basis of explanation. Of course, they also serve a dual purpose of debugging -- if we change the algorithm and we get a bad value for the "expected/likely value" then that allows one to inspect what exactly happened.

jotok · 2021-10-28T18:28:14Z

...arkservices/src/main/java/com/amazon/randomcutforest/parkservices/IRCFComputeDescriptor.java

+    double getRCFScore();
+
+    // the attribution of the entire shingled RCFPoint
+    DiVector getAttribution();


Does this method belong in the interface? It seems specific to the anomaly deteciton use case.

It could be useful for imputation (just like in AD we use the DiVector to identify the time slice). Score and attribution are central to the RCF ... so in some logic it makes sense to have them as basic options.

jotok · 2021-10-28T18:31:19Z

...arkservices/src/main/java/com/amazon/randomcutforest/parkservices/IRCFComputeDescriptor.java

+    int getRelativeIndex();
+
+    // the score on RCFPoint
+    double getRCFScore();


Does this belong in the interface? What if the statistic that we want is not a scalar value? Would it make sense to make the interface generic?

public interface IRCFComputeDescriptor<Statistic> { Statistic getRCFStatistic(); }

See above -- score and attribution are primitive constructs of RCF (given they have dedicated visitors). We should however think of making this automatic -- if a new visitor is defined then the new visitor automatically is evaluated and shows up. But perhaps in a later PR?

jotok · 2021-10-29T21:15:52Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/Point.java

+        this.inputTimestamp = inputTimestamp;
+    }
+
+    public void setCurrentInput(double[] currentValues) {


Does it make sense for this field to be settable? Is this method redundant with the @Setter annotation?

We are stuck with the 2.1 state classes and the mappers at the moment. We can remove this as we bump up version (likely soon with more functionality in Impute)

kaituo · 2021-11-01T22:16:48Z

ran a few manual integration tests using the commits. Looks good.

refactor + impute

04b7e30

sudiptoguha requested a review from jotok October 15, 2021 15:57

sudiptoguha added 6 commits October 15, 2021 09:57

fix consistency test

d661e8e

fixes

fa912e1

cleanup and fixes of ImputePreprocessor

324b6f1

renaming fields and introducing compute state

510f2ae

renaming fields and introducing compute state

3a012ef

cleanup

44c5310

kaituo reviewed Oct 20, 2021

View reviewed changes

sudiptoguha added 8 commits October 21, 2021 09:02

name changes and fixes

52b7e2a

consistency fix

6b181a3

refactor and introducing Point

a7f9336

javadocs and cleanup

59bc486

templates and interfaces

17356cb

fix:adding IPreprocessor

554862d

enabling anomaly detection for partial input

d8dbfa6

cleanup

a56388a

kaituo reviewed Oct 25, 2021

View reviewed changes

changes

ca73159

kaituo reviewed Oct 29, 2021

View reviewed changes

jotok reviewed Oct 29, 2021

View reviewed changes

changes

e2bb5d3

kaituo approved these changes Nov 1, 2021

View reviewed changes

jotok approved these changes Nov 4, 2021

View reviewed changes

jotok merged commit 9296855 into aws:main Nov 4, 2021

This was referenced Mar 25, 2022

AD&ml-commons fail to deserialize 1.3 model in latest RCF code #305

Closed

revert: implement Serializable for ThresholdedRandomCutForestState #306

Closed

ylwu-amzn mentioned this pull request Apr 15, 2022

initiating RCF3.0 #283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor + impute #282

refactor + impute #282

sudiptoguha commented Oct 15, 2021

kaituo left a comment

kaituo Oct 20, 2021

sudiptoguha Oct 25, 2021

kaituo left a comment

kaituo Oct 22, 2021

sudiptoguha Oct 25, 2021

sudiptoguha Oct 28, 2021

kaituo Oct 24, 2021

sudiptoguha Oct 25, 2021

kaituo Oct 26, 2021

kaituo Oct 29, 2021

jotok Oct 21, 2021

sudiptoguha Oct 29, 2021

jotok Oct 28, 2021

sudiptoguha Oct 29, 2021 •

edited

Loading

jotok Oct 28, 2021

sudiptoguha Oct 29, 2021

jotok Oct 29, 2021

sudiptoguha Oct 29, 2021

kaituo commented Nov 1, 2021

		// the inputlength; useful for standalone analysis
		int inputLength;

refactor + impute #282

refactor + impute #282

Conversation

sudiptoguha commented Oct 15, 2021

kaituo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudiptoguha Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo commented Nov 1, 2021

sudiptoguha Oct 29, 2021 •

edited

Loading