-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor + impute #282
refactor + impute #282
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial reviews. Finished 7/17 files.
Java/core/src/main/java/com/amazon/randomcutforest/config/ImputationMethod.java
Outdated
Show resolved
Hide resolved
@Setter | ||
public class RCFComputeDescriptor { | ||
|
||
// sequence index (the number of updates to RCF) -- it is possible in imputation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why the number of updates can be more than the input tuples seen by the overall program? Can the following be a cause?
Say I have point 1-10, and point 15, and I have internal shingling enabled, my shingle size is 6, to get the ball rolling, I need to impute point 11-14. Then I need to update rcf with point 11~14 and the total updates is 4 more than the input tuples seen by the overall program.
Also, can this happen (the number of updates more than the input tuples seen by the overall program) in external shingling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the concerns, related to any model is that "would we update the model with the values which we just imputed from the same model?" There are some scenarios (specially if number of imputations being low) where that makes sense. But if there are more errors/missing entries then it may make sense to control that -- this is done by the useImputedFraction (we only admit points where the ratio of imputed to total is below useImputedFraction).
...parkservices/src/main/java/com/amazon/randomcutforest/parkservices/RCFComputeDescriptor.java
Outdated
Show resolved
Hide resolved
...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java
Show resolved
Hide resolved
...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java
Outdated
Show resolved
Hide resolved
...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments after reading a few files from 52b7e2a
Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java
Show resolved
Hide resolved
...arkservices/src/main/java/com/amazon/randomcutforest/parkservices/IRCFComputeDescriptor.java
Show resolved
Hide resolved
|
||
// add explanation | ||
preprocessor.postProcess(description, forest); | ||
preprocessor.postProcess(description, lastAnomalyDescriptor, forest); | ||
|
||
if (ifZero) { // turn caching off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there is exception thrown during above steps and the program returns early and you won't be able to change the cache fraction back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During serialization the cache is always set to 0 and ignored. So there is no issue with serialization. If there is an exception thrown -- isn't there a bigger issue of what that exception is? If the fraction remains small then correctness is not affected, but speed is (and the next evaluation fixes it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this: this can be an issue for recovery strategies and the fraction may remain set at 1, so the memory will continue to be used till the model is serialized -- maybe resolve in next PR. Will need small change to BoxCache.
Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java
Show resolved
Hide resolved
Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java
Show resolved
Hide resolved
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
double[] scaledInput = transformValues(result, factors); | ||
updateShingle(result, scaledInput); | ||
updateTimestamps(initialTimeStamps[i]); | ||
numberOfImputed = numberOfImputed + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do numberOfImputed += numberToImpute at the beginning of the for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the decision to "allow" a tuple depends on the number of imputations (made-up information). We can move this up if we do not want such control (but I think we do).
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Outdated
Show resolved
Hide resolved
T answer = preprocessor.postProcess(core.apply(preprocessor.preProcess(input, lastAnomalyDescriptor, forest)), | ||
lastAnomalyDescriptor, forest); | ||
if (ifZero) { // turn caching off | ||
forest.setBoundingBoxCacheFraction(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re what we discussed earlier:
- we need to make sure to demolish the bounding box in BoxCache.setCacheFraction.
- We need to make sure to call forest.setBoundingBoxCacheFraction(0) in the presence of an exception. try..finally.. is one choice.
Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java
Show resolved
Hide resolved
Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java
Show resolved
Hide resolved
Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java
Show resolved
Hide resolved
...rvices/src/main/java/com/amazon/randomcutforest/parkservices/threshold/BasicThresholder.java
Show resolved
Hide resolved
// the inputlength; useful for standalone analysis | ||
int inputLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about the shingle size part? You are referring to the following code in Preprocessor, right?
AnomalyDescriptor initialSetup(AnomalyDescriptor description, IRCFComputeDescriptor lastAnomalyDescriptor,
RandomCutForest forest) {
...
description.setShingleSize(shingleSize);
shingleSize is a field of PreProcessor and the preprocessor get the shingle size when we construct trcf. If preprocessor can get the correct shingle size, why cannot forest get it?
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
...s/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/ImputePreprocessor.java
Show resolved
Hide resolved
&& result.getTimestamp() == dataWithKeys.changeIndices[keyCounter]) { | ||
System.out.println("timestamp " + (result.getTimestamp()) + " CHANGE"); | ||
&& result.getInternalTimeStamp() == dataWithKeys.changeIndices[keyCounter]) { | ||
System.out.println("timestamp " + (result.getInputTimestamp()) + " CHANGE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these print statements (here and elsewhere in the PR) intended to be in the final code, or were they used for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here (in the examples package) the print statements are for the purposes of demonstration -- how do we use the different fields, as as basis of explanation. Of course, they also serve a dual purpose of debugging -- if we change the algorithm and we get a bad value for the "expected/likely value" then that allows one to inspect what exactly happened.
double getRCFScore(); | ||
|
||
// the attribution of the entire shingled RCFPoint | ||
DiVector getAttribution(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this method belong in the interface? It seems specific to the anomaly deteciton use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be useful for imputation (just like in AD we use the DiVector to identify the time slice). Score and attribution are central to the RCF ... so in some logic it makes sense to have them as basic options.
int getRelativeIndex(); | ||
|
||
// the score on RCFPoint | ||
double getRCFScore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this belong in the interface? What if the statistic that we want is not a scalar value? Would it make sense to make the interface generic?
public interface IRCFComputeDescriptor<Statistic> {
Statistic getRCFStatistic();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above -- score and attribution are primitive constructs of RCF (given they have dedicated visitors). We should however think of making this automatic -- if a new visitor is defined then the new visitor automatically is evaluated and shows up. But perhaps in a later PR?
this.inputTimestamp = inputTimestamp; | ||
} | ||
|
||
public void setCurrentInput(double[] currentValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for this field to be settable? Is this method redundant with the @Setter
annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are stuck with the 2.1 state classes and the mappers at the moment. We can remove this as we bump up version (likely soon with more functionality in Impute)
ran a few manual integration tests using the commits. Looks good. |
Description of changes: Refactors ThresholdedRCF for impute.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.