Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use current time as training data end time #547

Merged
merged 2 commits into from
May 17, 2022

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented May 13, 2022

Description

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:

  1. manually verified the bug is fixed.
  2. fixed all related unit tests.

Signed-off-by: Kaituo Li [email protected]

Note: I found the bug in 1.2, so I started with 1.2 branch. Will forward push to 1.3 and 2.x later.

Issues Resolved

#540

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
@kaituo kaituo requested review from amitgalitz and ylwu-amzn May 13, 2022 23:02
@codecov-commenter
Copy link

codecov-commenter commented May 13, 2022

Codecov Report

Merging #547 (f78e767) into 1.2 (e2f3497) will increase coverage by 0.04%.
The diff coverage is 50.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##                1.2     #547      +/-   ##
============================================
+ Coverage     76.23%   76.27%   +0.04%     
- Complexity     3963     3966       +3     
============================================
  Files           295      295              
  Lines         17180    17178       -2     
  Branches       1812     1814       +2     
============================================
+ Hits          13097    13103       +6     
+ Misses         3258     3248      -10     
- Partials        825      827       +2     
Flag Coverage Δ
plugin 76.27% <50.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/java/org/opensearch/ad/ml/EntityColdStarter.java 79.18% <43.75%> (-1.97%) ⬇️
.../opensearch/ad/ratelimit/CheckpointReadWorker.java 87.68% <100.00%> (+0.44%) ⬆️
...opensearch/ad/transport/AnomalyResultResponse.java 71.96% <0.00%> (-12.13%) ⬇️
...ansport/handler/AnomalyResultBulkIndexHandler.java 67.74% <0.00%> (-3.23%) ⬇️
...va/org/opensearch/ad/feature/SearchFeatureDao.java 84.56% <0.00%> (-2.08%) ⬇️
...ain/java/org/opensearch/ad/task/ADTaskManager.java 76.90% <0.00%> (+0.30%) ⬆️
.../main/java/org/opensearch/ad/NodeStateManager.java 71.89% <0.00%> (+0.65%) ⬆️
...rch/ad/transport/AnomalyResultTransportAction.java 80.86% <0.00%> (+0.68%) ⬆️
...ava/org/opensearch/ad/task/ADHCBatchTaskCache.java 90.12% <0.00%> (+1.23%) ⬆️
... and 2 more

@@ -220,6 +220,22 @@ private void coldStart(
) {
logger.debug("Trigger cold start for {}", modelId);

if (modelState == null || entity == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If data isn't present currently but will get ingested in the future and we are expecting a long initialization would either of this condition be met until then?

Copy link
Collaborator Author

@kaituo kaituo May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is method invariant and we don't expect any of them to be true, regardless of whether data is present or not.

@@ -404,42 +419,23 @@ private void getEntityColdStartData(String detectorId, Entity entity, ActionList
ActionListener<Optional<Long>> minTimeListener = ActionListener.wrap(earliest -> {
if (earliest.isPresent()) {
long startTimeMs = earliest.get().longValue();
nodeStateManager.getAnomalyDetectorJob(detectorId, ActionListener.wrap(jobOp -> {
if (!jobOp.isPresent()) {
listener.onFailure(new EndRunException(detectorId, "AnomalyDetector job is not available.", false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it seems like we used cold start data and samples for the training data. Can you further explain why we are moving away from this or was no sample data actually ever being used because we didn't add samples until success anyways (as you mention in description)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sample data actually ever being used because we didn't add samples until success.

private void combineTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) {
if (coldstartDatapoints == null || coldstartDatapoints.size() == 0) {
private void extractTrainSamples(List<double[][]> coldstartDatapoints, String modelId, ModelState<EntityModel> entityState) {
if (coldstartDatapoints == null || coldstartDatapoints.size() == 0 || entityState == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for extra safety to access .getModel later on? Since we know that if modelState was null earlier an exception would've been thrown so this might not be needed. Also should we keep naming consistent or indicate reason of transition from modelState to entityState name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is for extra safety. It is not needed now but is helpful to prevent unintended bugs as we may forget the invariant.

yes, let me change to use modelState.

return;
}

EntityModel model = entityState.getModel();
if (model == null) {
model = new EntityModel(null, new ArrayDeque<>(), null);
entityState.setModel(model);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for setting an empty model now where previously we haven't done this.

Copy link
Collaborator Author

@kaituo kaituo May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not setting an empty model is a bug. But because the model will not be empty in current workflow it does not manifest itself.

@amitgalitz
Copy link
Member

To clarify with an example, is the bug here that we might have historical data from 1pm to 6pm. Current time is 6:10pm. Could job enabled time be something like 4pm after running cold start once or more already? And then we wont use data past 4 pm for training? With this PR we will use the current time of 6:10pm as training data end time? Will this add implications for long initializations that requires more data to be ingested for initialization to end anyways?

@kaituo
Copy link
Collaborator Author

kaituo commented May 17, 2022

To clarify with an example, is the bug here that we might have historical data from 1pm to 6pm. Current time is 6:10pm. Could job enabled time be something like 4pm after running cold start once or more already? And then we wont use data past 4 pm for training? With this PR we will use the current time of 6:10pm as training data end time? Will this add implications for long initializations that requires more data to be ingested for initialization to end anyways?

The example is right except the timing number may need to change. Depends on the interval, the data fetching algorithm is different (check 2ce24a0). In one scenario, we will use last 40 samples with two tries. At most, we will look back 80 intervals. So if the interval is at least 2 minutes, we will use some of the data older than 4 pm.

What you said can happen. If we keep current code, users will have to not delete data before enabled time or manipulate data timestamps to make it look old to make following cold start to succeed. After the fix, they can keep ingesting new data and the cold start will succeed eventually. The latter UX seems better to me.

Signed-off-by: Kaituo Li <[email protected]>
// cold start data as existing samples all happen after job enabled time. There might
// be some gaps in between the last cold start sample and the first accumulated sample.
// We will need to accept that precision loss in current solution.
long endTimeMs = job.getEnabledTime().toEpochMilli();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If HC detector realtime job not restarted, the enabled time won't change. If there is no enough data before job enabled time and user don't backfill historical data, there is no chance to pass cold start, right? Seems a critical bug if that's true. We'd better backfill to 1.x too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. yes, will backfill.

@kaituo kaituo requested review from ylwu-amzn and amitgalitz May 17, 2022 22:03
Copy link
Collaborator

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for fixing!

Copy link
Member

@amitgalitz amitgalitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for making the fix!

@kaituo kaituo merged commit 2ebaa08 into opensearch-project:1.2 May 17, 2022
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request May 17, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit that referenced this pull request May 17, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request May 23, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request May 23, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request May 23, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit that referenced this pull request May 23, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit that referenced this pull request May 23, 2022
* Use current time as training data end time

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
kaituo added a commit that referenced this pull request May 24, 2022
… (#556)

* Use current time as training data end time (#547)

The bug happens because we use job enabled time as training data end time. But if the historical data before that time is deleted or does not exist at all, cold start might never finish. This PR uses current time as the training data end time so that cold start has a chance to succeed later. This PR also removes the code that combines cold start data and existing samples in EntityColdStartWorker because we don't add samples until cold start succeeds. Combining cold start data and existing samples is thus unnecessary.

Testing done:
1. manually verified the bug is fixed.
2. fixed all related unit tests.

Signed-off-by: Kaituo Li <[email protected]>
@amitgalitz amitgalitz added the bug Something isn't working label Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants