-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use current time as training data end time #547
Conversation
The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
Codecov Report
@@ Coverage Diff @@
## 1.2 #547 +/- ##
============================================
+ Coverage 76.23% 76.27% +0.04%
- Complexity 3963 3966 +3
============================================
Files 295 295
Lines 17180 17178 -2
Branches 1812 1814 +2
============================================
+ Hits 13097 13103 +6
+ Misses 3258 3248 -10
- Partials 825 827 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -220,6 +220,22 @@ private void coldStart( | |||
) { | |||
logger.debug("Trigger cold start for {}", modelId); | |||
|
|||
if (modelState == null || entity == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If data isn't present currently but will get ingested in the future and we are expecting a long initialization would either of this condition be met until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is method invariant and we don't expect any of them to be true, regardless of whether data is present or not.
@@ -404,42 +419,23 @@ private void getEntityColdStartData(String detectorId, Entity entity, ActionList | |||
ActionListener<Optional<Long>> minTimeListener = ActionListener.wrap(earliest -> { | |||
if (earliest.isPresent()) { | |||
long startTimeMs = earliest.get().longValue(); | |||
nodeStateManager.getAnomalyDetectorJob(detectorId, ActionListener.wrap(jobOp -> { | |||
if (!jobOp.isPresent()) { | |||
listener.onFailure(new EndRunException(detectorId, "AnomalyDetector job is not available.", false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before it seems like we used cold start data and samples for the training data. Can you further explain why we are moving away from this or was no sample data actually ever being used because we didn't add samples until success anyways (as you mention in description)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sample data actually ever being used because we didn't add samples until success.
private void combineTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) { | ||
if (coldstartDatapoints == null || coldstartDatapoints.size() == 0) { | ||
private void extractTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) { | ||
if (coldstartDatapoints == null || coldstartDatapoints.size() == 0 || entityState == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for extra safety to access .getModel
later on? Since we know that if modelState was null earlier an exception would've been thrown so this might not be needed. Also should we keep naming consistent or indicate reason of transition from modelState to entityState name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is for extra safety. It is not needed now but is helpful to prevent unintended bugs as we may forget the invariant.
yes, let me change to use modelState.
return; | ||
} | ||
|
||
EntityModel model = entityState.getModel(); | ||
if (model == null) { | ||
model = new EntityModel(null, new ArrayDeque<>(), null); | ||
entityState.setModel(model); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for setting an empty model now where previously we haven't done this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not setting an empty model is a bug. But because the model will not be empty in current workflow it does not manifest itself.
To clarify with an example, is the bug here that we might have historical data from 1pm to 6pm. Current time is 6:10pm. Could job enabled time be something like 4pm after running cold start once or more already? And then we wont use data past 4 pm for training? With this PR we will use the current time of 6:10pm as training data end time? Will this add implications for long initializations that requires more data to be ingested for initialization to end anyways? |
The example is right except the timing number may need to change. Depends on the interval, the data fetching algorithm is different (check 2ce24a0). In one scenario, we will use last 40 samples with two tries. At most, we will look back 80 intervals. So if the interval is at least 2 minutes, we will use some of the data older than 4 pm. What you said can happen. If we keep current code, users will have to not delete data before enabled time or manipulate data timestamps to make it look old to make following cold start to succeed. After the fix, they can keep ingesting new data and the cold start will succeed eventually. The latter UX seems better to me. |
Signed-off-by: Kaituo Li <[email protected]>
// cold start data as existing samples all happen after job enabled time. There might | ||
// be some gaps in between the last cold start sample and the first accumulated sample. | ||
// We will need to accept that precision loss in current solution. | ||
long endTimeMs = job.getEnabledTime().toEpochMilli(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If HC detector realtime job not restarted, the enabled time won't change. If there is no enough data before job enabled time and user don't backfill historical data, there is no chance to pass cold start, right? Seems a critical bug if that's true. We'd better backfill to 1.x too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. yes, will backfill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for fixing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for making the fix!
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
… (#556) * Use current time as training data end time (#547) The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>
Description
The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.
Testing done:
Signed-off-by: Kaituo Li [email protected]
Note: I found the bug in 1.2, so I started with 1.2 branch. Will forward push to 1.3 and 2.x later.
Issues Resolved
#540
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.