Use current time as training data end time #547

kaituo · 2022-05-13T23:02:16Z

Description

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:

manually verified the bug is fixed.
fixed all related unit tests.

Signed-off-by: Kaituo Li [email protected]

Note: I found the bug in 1.2, so I started with 1.2 branch. Will forward push to 1.3 and 2.x later.

Issues Resolved

#540

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>

codecov-commenter · 2022-05-13T23:35:43Z

Codecov Report

Merging #547 (f78e767) into 1.2 (e2f3497) will increase coverage by 0.04%.
The diff coverage is 50.00%.

@@             Coverage Diff              @@
##                1.2     #547      +/-   ##
============================================
+ Coverage     76.23%   76.27%   +0.04%     
- Complexity     3963     3966       +3     
============================================
  Files           295      295              
  Lines         17180    17178       -2     
  Branches       1812     1814       +2     
============================================
+ Hits          13097    13103       +6     
+ Misses         3258     3248      -10     
- Partials        825      827       +2

Flag	Coverage Δ
plugin	`76.27% <50.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...n/java/org/opensearch/ad/ml/EntityColdStarter.java	`79.18% <43.75%> (-1.97%)`	⬇️
.../opensearch/ad/ratelimit/CheckpointReadWorker.java	`87.68% <100.00%> (+0.44%)`	⬆️
...opensearch/ad/transport/AnomalyResultResponse.java	`71.96% <0.00%> (-12.13%)`	⬇️
...ansport/handler/AnomalyResultBulkIndexHandler.java	`67.74% <0.00%> (-3.23%)`	⬇️
...va/org/opensearch/ad/feature/SearchFeatureDao.java	`84.56% <0.00%> (-2.08%)`	⬇️
...ain/java/org/opensearch/ad/task/ADTaskManager.java	`76.90% <0.00%> (+0.30%)`	⬆️
.../main/java/org/opensearch/ad/NodeStateManager.java	`71.89% <0.00%> (+0.65%)`	⬆️
...rch/ad/transport/AnomalyResultTransportAction.java	`80.86% <0.00%> (+0.68%)`	⬆️
...ava/org/opensearch/ad/task/ADHCBatchTaskCache.java	`90.12% <0.00%> (+1.23%)`	⬆️
... and 2 more

amitgalitz · 2022-05-17T19:05:58Z

src/main/java/org/opensearch/ad/ml/EntityColdStarter.java

@@ -220,6 +220,22 @@ private void coldStart(
    ) {
        logger.debug("Trigger cold start for {}", modelId);

+        if (modelState == null || entity == null) {


If data isn't present currently but will get ingested in the future and we are expecting a long initialization would either of this condition be met until then?

This is method invariant and we don't expect any of them to be true, regardless of whether data is present or not.

amitgalitz · 2022-05-17T19:09:29Z

src/main/java/org/opensearch/ad/ml/EntityColdStarter.java

@@ -404,42 +419,23 @@ private void getEntityColdStartData(String detectorId, Entity entity, ActionList
            ActionListener<Optional<Long>> minTimeListener = ActionListener.wrap(earliest -> {
                if (earliest.isPresent()) {
                    long startTimeMs = earliest.get().longValue();
-                    nodeStateManager.getAnomalyDetectorJob(detectorId, ActionListener.wrap(jobOp -> {
-                        if (!jobOp.isPresent()) {
-                            listener.onFailure(new EndRunException(detectorId, "AnomalyDetector job is not available.", false));


Before it seems like we used cold start data and samples for the training data. Can you further explain why we are moving away from this or was no sample data actually ever being used because we didn't add samples until success anyways (as you mention in description)?

No sample data actually ever being used because we didn't add samples until success.

amitgalitz · 2022-05-17T19:10:28Z

src/main/java/org/opensearch/ad/ml/EntityColdStarter.java

-    private void combineTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) {
-        if (coldstartDatapoints == null || coldstartDatapoints.size() == 0) {
+    private void extractTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) {
+        if (coldstartDatapoints == null || coldstartDatapoints.size() == 0 || entityState == null) {


Is this needed for extra safety to access .getModel later on? Since we know that if modelState was null earlier an exception would've been thrown so this might not be needed. Also should we keep naming consistent or indicate reason of transition from modelState to entityState name.

yes, it is for extra safety. It is not needed now but is helpful to prevent unintended bugs as we may forget the invariant.

yes, let me change to use modelState.

amitgalitz · 2022-05-17T19:10:47Z

src/main/java/org/opensearch/ad/ml/EntityColdStarter.java

            return;
        }

        EntityModel model = entityState.getModel();
        if (model == null) {
            model = new EntityModel(null, new ArrayDeque<>(), null);
+            entityState.setModel(model);


What is the reason for setting an empty model now where previously we haven't done this.

not setting an empty model is a bug. But because the model will not be empty in current workflow it does not manifest itself.

amitgalitz · 2022-05-17T19:12:29Z

To clarify with an example, is the bug here that we might have historical data from 1pm to 6pm. Current time is 6:10pm. Could job enabled time be something like 4pm after running cold start once or more already? And then we wont use data past 4 pm for training? With this PR we will use the current time of 6:10pm as training data end time? Will this add implications for long initializations that requires more data to be ingested for initialization to end anyways?

kaituo · 2022-05-17T21:41:52Z

To clarify with an example, is the bug here that we might have historical data from 1pm to 6pm. Current time is 6:10pm. Could job enabled time be something like 4pm after running cold start once or more already? And then we wont use data past 4 pm for training? With this PR we will use the current time of 6:10pm as training data end time? Will this add implications for long initializations that requires more data to be ingested for initialization to end anyways?

The example is right except the timing number may need to change. Depends on the interval, the data fetching algorithm is different (check 2ce24a0). In one scenario, we will use last 40 samples with two tries. At most, we will look back 80 intervals. So if the interval is at least 2 minutes, we will use some of the data older than 4 pm.

What you said can happen. If we keep current code, users will have to not delete data before enabled time or manipulate data timestamps to make it look old to make following cold start to succeed. After the fix, they can keep ingesting new data and the cold start will succeed eventually. The latter UX seems better to me.

Signed-off-by: Kaituo Li <[email protected]>

ylwu-amzn · 2022-05-17T21:56:52Z

src/main/java/org/opensearch/ad/ml/EntityColdStarter.java

-                        // cold start data as existing samples all happen after job enabled time. There might
-                        // be some gaps in between the last cold start sample and the first accumulated sample.
-                        // We will need to accept that precision loss in current solution.
-                        long endTimeMs = job.getEnabledTime().toEpochMilli();


If HC detector realtime job not restarted, the enabled time won't change. If there is no enough data before job enabled time and user don't backfill historical data, there is no chance to pass cold start, right? Seems a critical bug if that's true. We'd better backfill to 1.x too.

right. yes, will backfill.

ylwu-amzn

LGTM. Thanks for fixing!

amitgalitz

LGTM, thanks for making the fix!

* Use current time as training data end time The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>

… (#556) * Use current time as training data end time (#547) The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary. Testing done: 1. manually verified the bug is fixed. 2. fixed all related unit tests. Signed-off-by: Kaituo Li <[email protected]>

kaituo requested review from amitgalitz and ylwu-amzn May 13, 2022 23:02

amitgalitz reviewed May 17, 2022

View reviewed changes

rename variable

f78e767

Signed-off-by: Kaituo Li <[email protected]>

kaituo force-pushed the coldStartRetry branch from a4c744c to f78e767 Compare May 17, 2022 21:47

ylwu-amzn reviewed May 17, 2022

View reviewed changes

kaituo requested review from ylwu-amzn and amitgalitz May 17, 2022 22:03

ylwu-amzn approved these changes May 17, 2022

View reviewed changes

amitgalitz approved these changes May 17, 2022

View reviewed changes

kaituo merged commit 2ebaa08 into opensearch-project:1.2 May 17, 2022

amitgalitz added the bug Something isn't working label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use current time as training data end time #547

Use current time as training data end time #547

kaituo commented May 13, 2022 •

edited

Loading

codecov-commenter commented May 13, 2022 •

edited

Loading

amitgalitz May 17, 2022

kaituo May 17, 2022 •

edited

Loading

amitgalitz May 17, 2022

kaituo May 17, 2022

amitgalitz May 17, 2022

kaituo May 17, 2022

amitgalitz May 17, 2022

kaituo May 17, 2022 •

edited

Loading

amitgalitz commented May 17, 2022

kaituo commented May 17, 2022 •

edited

Loading

ylwu-amzn May 17, 2022

kaituo May 17, 2022

ylwu-amzn left a comment

amitgalitz left a comment

Use current time as training data end time #547

Use current time as training data end time #547

Conversation

kaituo commented May 13, 2022 • edited Loading

Description

Issues Resolved

codecov-commenter commented May 13, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

kaituo May 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo May 17, 2022 • edited Loading

Choose a reason for hiding this comment

amitgalitz commented May 17, 2022

kaituo commented May 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn left a comment

Choose a reason for hiding this comment

amitgalitz left a comment

Choose a reason for hiding this comment

kaituo commented May 13, 2022 •

edited

Loading

codecov-commenter commented May 13, 2022 •

edited

Loading

kaituo May 17, 2022 •

edited

Loading

kaituo May 17, 2022 •

edited

Loading

kaituo commented May 17, 2022 •

edited

Loading