Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 1.3] Fix restart HCAD detector bug #466

Closed
wants to merge 1 commit into from

Conversation

opensearch-trigger-bot[bot]
Copy link

Backport 9dd9718 from #460

* Fix restart HCAD detector bug

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.

Signed-off-by: Kaituo Li <[email protected]>
(cherry picked from commit 9dd9718)
@codecov-commenter
Copy link

Codecov Report

Merging #466 (a9d3be9) into 1.3 (bf8f2da) will decrease coverage by 0.13%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##                1.3     #466      +/-   ##
============================================
- Coverage     77.58%   77.45%   -0.14%     
+ Complexity     4105     4102       -3     
============================================
  Files           296      296              
  Lines         17669    17673       +4     
  Branches       1878     1878              
============================================
- Hits          13709    13689      -20     
- Misses         3057     3078      +21     
- Partials        903      906       +3     
Flag Coverage Δ
plugin 77.45% <100.00%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/java/org/opensearch/ad/ml/EntityColdStarter.java 83.73% <100.00%> (+2.59%) ⬆️
...earch/ad/transport/DeleteModelTransportAction.java 96.15% <100.00%> (+0.32%) ⬆️
...ain/java/org/opensearch/ad/model/ModelProfile.java 69.09% <0.00%> (-3.64%) ⬇️
...java/org/opensearch/ad/task/ADBatchTaskRunner.java 75.22% <0.00%> (-3.04%) ⬇️
...ansport/handler/AnomalyResultBulkIndexHandler.java 67.74% <0.00%> (-1.62%) ⬇️
...in/java/org/opensearch/ad/model/AnomalyResult.java 81.60% <0.00%> (-1.34%) ⬇️
.../main/java/org/opensearch/ad/ml/CheckpointDao.java 69.55% <0.00%> (-0.65%) ⬇️
...ain/java/org/opensearch/ad/task/ADTaskManager.java 76.67% <0.00%> (-0.16%) ⬇️
.../main/java/org/opensearch/ad/NodeStateManager.java 72.25% <0.00%> (+0.64%) ⬆️

@ylwu-amzn
Copy link
Collaborator

https://github.com/opensearch-project/anomaly-detection/runs/5668411828?check_suite_focus=true
Seems testRestartHCADDetector is flaky

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.ad.e2e.DetectionResultEvalutationIT.testRestartHCADDetector" -Dtests.seed=D2A20EB356E8BD44 -Dtests.security.manager=false -Dtests.locale=hi-IN -Dtests.timezone=Africa/Windhoek -Druntime.java=14
org.opensearch.ad.e2e.DetectionResultEvalutationIT > testRestartHCADDetector FAILED
    java.lang.AssertionError: failed all retries
        at __randomizedtesting.SeedInfo.seed([D2A20EB356E8BD44:E96CEBE21FDC2346]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.ad.e2e.DetectionResultEvalutationIT.testRestartHCADDetector(DetectionResultEvalutationIT.java:614)

@kaituo
Copy link
Collaborator

kaituo commented Mar 25, 2022

https://github.com/opensearch-project/anomaly-detection/runs/5668411828?check_suite_focus=true Seems testRestartHCADDetector is flaky

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.ad.e2e.DetectionResultEvalutationIT.testRestartHCADDetector" -Dtests.seed=D2A20EB356E8BD44 -Dtests.security.manager=false -Dtests.locale=hi-IN -Dtests.timezone=Africa/Windhoek -Druntime.java=14
org.opensearch.ad.e2e.DetectionResultEvalutationIT > testRestartHCADDetector FAILED
    java.lang.AssertionError: failed all retries
        at __randomizedtesting.SeedInfo.seed([D2A20EB356E8BD44:E96CEBE21FDC2346]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.ad.e2e.DetectionResultEvalutationIT.testRestartHCADDetector(DetectionResultEvalutationIT.java:614)

will debug and fix.

@kaituo
Copy link
Collaborator

kaituo commented Mar 28, 2022

The fix will be in #456. Close this PR.

@kaituo kaituo closed this Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants