Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor + impute #282

Merged
merged 17 commits into from
Nov 4, 2021
Merged

refactor + impute #282

merged 17 commits into from
Nov 4, 2021

Conversation

sudiptoguha
Copy link
Contributor

Description of changes: Refactors ThresholdedRCF for impute.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sudiptoguha sudiptoguha requested a review from jotok October 15, 2021 15:57
Copy link
Collaborator

@kaituo kaituo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial reviews. Finished 7/17 files.

@Setter
public class RCFComputeDescriptor {

// sequence index (the number of updates to RCF) -- it is possible in imputation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why the number of updates can be more than the input tuples seen by the overall program? Can the following be a cause?

Say I have point 1-10, and point 15, and I have internal shingling enabled, my shingle size is 6, to get the ball rolling, I need to impute point 11-14. Then I need to update rcf with point 11~14 and the total updates is 4 more than the input tuples seen by the overall program.

Also, can this happen (the number of updates more than the input tuples seen by the overall program) in external shingling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the concerns, related to any model is that "would we update the model with the values which we just imputed from the same model?" There are some scenarios (specially if number of imputations being low) where that makes sense. But if there are more errors/missing entries then it may make sense to control that -- this is done by the useImputedFraction (we only admit points where the ratio of imputed to total is below useImputedFraction).

Copy link
Collaborator

@kaituo kaituo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments after reading a few files from 52b7e2a


// add explanation
preprocessor.postProcess(description, forest);
preprocessor.postProcess(description, lastAnomalyDescriptor, forest);

if (ifZero) { // turn caching off
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there is exception thrown during above steps and the program returns early and you won't be able to change the cache fraction back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During serialization the cache is always set to 0 and ignored. So there is no issue with serialization. If there is an exception thrown -- isn't there a bigger issue of what that exception is? If the fraction remains small then correctness is not affected, but speed is (and the next evaluation fixes it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this: this can be an issue for recovery strategies and the fraction may remain set at 1, so the memory will continue to be used till the model is serialized -- maybe resolve in next PR. Will need small change to BoxCache.

double[] scaledInput = transformValues(result, factors);
updateShingle(result, scaledInput);
updateTimestamps(initialTimeStamps[i]);
numberOfImputed = numberOfImputed + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do numberOfImputed += numberToImpute at the beginning of the for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the decision to "allow" a tuple depends on the number of imputations (made-up information). We can move this up if we do not want such control (but I think we do).

T answer = preprocessor.postProcess(core.apply(preprocessor.preProcess(input, lastAnomalyDescriptor, forest)),
lastAnomalyDescriptor, forest);
if (ifZero) { // turn caching off
forest.setBoundingBoxCacheFraction(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re what we discussed earlier:

  1. we need to make sure to demolish the bounding box in BoxCache.setCacheFraction.
  2. We need to make sure to call forest.setBoundingBoxCacheFraction(0) in the presence of an exception. try..finally.. is one choice.

Comment on lines +50 to +51
// the inputlength; useful for standalone analysis
int inputLength;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about the shingle size part? You are referring to the following code in Preprocessor, right?

AnomalyDescriptor initialSetup(AnomalyDescriptor description, IRCFComputeDescriptor lastAnomalyDescriptor,
            RandomCutForest forest) {
...
description.setShingleSize(shingleSize);

shingleSize is a field of PreProcessor and the preprocessor get the shingle size when we construct trcf. If preprocessor can get the correct shingle size, why cannot forest get it?

&& result.getTimestamp() == dataWithKeys.changeIndices[keyCounter]) {
System.out.println("timestamp " + (result.getTimestamp()) + " CHANGE");
&& result.getInternalTimeStamp() == dataWithKeys.changeIndices[keyCounter]) {
System.out.println("timestamp " + (result.getInputTimestamp()) + " CHANGE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these print statements (here and elsewhere in the PR) intended to be in the final code, or were they used for debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (in the examples package) the print statements are for the purposes of demonstration -- how do we use the different fields, as as basis of explanation. Of course, they also serve a dual purpose of debugging -- if we change the algorithm and we get a bad value for the "expected/likely value" then that allows one to inspect what exactly happened.

double getRCFScore();

// the attribution of the entire shingled RCFPoint
DiVector getAttribution();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method belong in the interface? It seems specific to the anomaly deteciton use case.

Copy link
Contributor Author

@sudiptoguha sudiptoguha Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful for imputation (just like in AD we use the DiVector to identify the time slice). Score and attribution are central to the RCF ... so in some logic it makes sense to have them as basic options.

int getRelativeIndex();

// the score on RCFPoint
double getRCFScore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this belong in the interface? What if the statistic that we want is not a scalar value? Would it make sense to make the interface generic?

public interface IRCFComputeDescriptor<Statistic> {
  Statistic getRCFStatistic();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above -- score and attribution are primitive constructs of RCF (given they have dedicated visitors). We should however think of making this automatic -- if a new visitor is defined then the new visitor automatically is evaluated and shows up. But perhaps in a later PR?

this.inputTimestamp = inputTimestamp;
}

public void setCurrentInput(double[] currentValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this field to be settable? Is this method redundant with the @Setter annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are stuck with the 2.1 state classes and the mappers at the moment. We can remove this as we bump up version (likely soon with more functionality in Impute)

@kaituo
Copy link
Collaborator

kaituo commented Nov 1, 2021

ran a few manual integration tests using the commits. Looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants