Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Extensions] Migrates AnomalyResultAction, EntityResultAction, RCFResultAction #856

Merged
merged 17 commits into from
Apr 21, 2023

Conversation

joshpalis
Copy link
Member

@joshpalis joshpalis commented Apr 5, 2023

Description

Part of opensearch-project/opensearch-sdk-java#383

This PR registers the actions that execute an anomaly detector and facilitate the indexing of anomaly results. Currently, the job configuration, scheduling, execution and result indexing for SINGLE ENTITY detectors for REAL-TIME analysis is complete.

Some example logs of the single entity real time analysis job execution.

02:09:44.780 [httpclient-dispatch-1] INFO  org.opensearch.ad.transport.AnomalyResultTransportAction - Sending RCF request to ad-extension-1 for model OBMtc4cBmBpoB0ddeoRK_model_rcf_0
02:09:44.780 [httpclient-dispatch-1] INFO  org.opensearch.ad.transport.RCFResultTransportAction - Serve rcf request for OBMtc4cBmBpoB0ddeoRK_model_rcf_0
02:09:44.788 [httpclient-dispatch-1] DEBUG org.opensearch.ad.AnomalyDetectorJobRunner - Update latest realtime task for SINGLE detector OBMtc4cBmBpoB0ddeoRK, total updates: 1444
02:09:44.808 [httpclient-dispatch-5] DEBUG org.opensearch.ad.transport.handler.AnomalyIndexHandler - Succeed in saving OBMtc4cBmBpoB0ddeoRK
02:09:44.809 [httpclient-dispatch-1] INFO  org.opensearch.ad.AnomalyDetectorJobRunner - Released lock for AD job OBMtc4cBmBpoB0ddeoRK

Preparational work for MUTLI ENTITY detectors for REAL-TIME analysis has been done. This PR enables creating a multi-entity detector, scheduling and executing a job. The remaining item is to generate and index the anomaly results for HCAD (the blockers for multi-entity real time analysis is noted in the issue). At this state, when a multi-entity detector job is run, the job execution will end while attempting to index an anomaly result and begin again in the next iteration.

Note, the hashring has been removed from the RCFResultTransportAction and the AnomalyResultTransportAction. All requests to run the detector will be executed by the AD extension node.

Overview of the job execution process for single entity real time analysis

  1. Ingest data and create single entity detector for real time analysis
  2. Start Detector for real time analysis, schedules AD job
  3. ADJobRunner receives runJob() request from Job Scheduler, sends a request to acquire a lock and then first job execution begins and triggers the AnomalyResultAction
  4. enabled features are processed, verifies data within the detection window is available (this checks if there are any entries in which the @timestamp value falls between the detection window)
  5. RCFResultAction is triggered via the client for extension node
  6. modelManager is used to retrieve TRcfResult for the model ID associated with the detector
  7. checkpointDao is used to retrieve TRCFModel (This returns nothing as checkpoint index is not initialized and there is no model that has been trained)
  8. Attempting to process the model will throw ResourceNotFoundException since the checkpoint is not initialized and the model is not present
  9. Exception failure returns to rcfActionListener.onFailure() which handles the prediction failure, and then invokes coldStartIfNoModel
  10. Exception is then checked if instance of ResourceNotFoundException thrown by the model manager
  11. If there are no previous exceptions that were thrown due to other cold start attempts, cold start is triggered
  12. Within cold start, the model manager then trains a model
  13. The trained model is then taken by the checkpoint DAO, determines if the checkpoint exists (if not it initializes it), and then indexes the model checkpoint
  14. If no exceptions occurs in cold start, then rcfActionListener triggers AnomalyResultTransportAction listener.onFailure() to return no model exception (model not ready)
  15. The exception is then returned to the ADJobRunner, which then indexes exception to AnomalyResultIndex (creates it if not present)
  16. The lock is then released and the next execution of the job is scheduled
  17. The next execution is triggered, steps 3 - 7 repeats, and the model is retrieved from the checkpoint index
  18. The model is then used to retrieve the ThreshHoldedRandomCutForest result and returned as an RCFResultResponse
  19. This response is then eventually indexed into the AnomalyResult Index and the next job execution is scheduled

How to run data generation script

  1. Modify this line and set the refresh interval to "1s"
  2. Follow this README to install the necessary requirements
  3. Start AnomalyDetectionExtension and OpenSearch with Job Scheduler installed (enable feature flag)
  4. start data ingestion into the local cluster : python3 generate-cosine-data-multi-entity.py -ep localhost -i server_log --shards 5 -t 10 --no-security -nh 5 -np 5 --ingestion-frequency 10 --points 30000 (will take about 30 seconds)

Detector Configurations Used to test real time analysis

Once the data in ingested, then we can create a detector for the created index server_log with the following configurations. The data generation script provides test data with two feature fields ( cpuTime and jvmGcTime ), and two category fields ( host and process ) that can be used for HCAD. Once the detector is created, we can start the detector

Single-Entity Detector :

{
  "name": "test-detector",
  "description": "Test detector",
  "time_field": "@timestamp",
  "indices": [
    "server_log"
  ],
  "feature_attributes": [
    {
      "feature_name": "test",
      "feature_enabled": true,
      "aggregation_query": {
        "test": {
          "avg": {
            "field": "jvmGcTime"
          }
        }
      }
    }
  ],
  "detection_interval": {
    "period": {
      "interval": 1,
      "unit": "Minutes"
    }
  },
  "window_delay": {
    "period": {
      "interval": 1,
      "unit": "Minutes"
    }
  }
}

Multi-Entity Detector

{
  "name": "test-detector",
  "description": "Test detector",
  "time_field": "@timestamp",
  "indices": [
    "server_log"
  ],
  "feature_attributes": [
    {
      "feature_name": "test",
      "feature_enabled": true,
      "aggregation_query": {
        "test": {
          "avg": {
            "field": "cpuTime"
          }
        }
      }
    }
  ],
  "detection_interval": {
    "period": {
      "interval": 1,
      "unit": "Minutes"
    }
  },
  "window_delay": {
    "period": {
      "interval": 1,
      "unit": "Minutes"
    }
  },
  "category_field": [
    "host"
  ]
}

Issues Resolved

opensearch-project/opensearch-sdk-java#626

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@joshpalis joshpalis marked this pull request as ready for review April 12, 2023 22:56
@joshpalis joshpalis requested review from a team, dbwiddis, owaiskazi19 and vibrantvarun April 12, 2023 22:56
@codecov-commenter
Copy link

codecov-commenter commented Apr 13, 2023

Codecov Report

Merging #856 (08fb4ab) into feature/extensions (86b1084) will decrease coverage by 0.65%.
The diff coverage is 1.98%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Impacted file tree graph

@@                   Coverage Diff                    @@
##             feature/extensions     #856      +/-   ##
========================================================
- Coverage                 35.93%   35.28%   -0.65%     
+ Complexity                 1926     1897      -29     
========================================================
  Files                       299      299              
  Lines                     17678    17788     +110     
  Branches                   1861     1864       +3     
========================================================
- Hits                       6352     6276      -76     
- Misses                    10870    11061     +191     
+ Partials                    456      451       -5     
Flag Coverage Δ
plugin 35.28% <1.98%> (-0.65%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...va/org/opensearch/ad/AnomalyDetectorExtension.java 0.00% <0.00%> (ø)
...va/org/opensearch/ad/AnomalyDetectorJobRunner.java 0.00% <0.00%> (ø)
...rg/opensearch/ad/AnomalyDetectorProfileRunner.java 8.42% <ø> (ø)
src/main/java/org/opensearch/ad/NodeState.java 52.00% <0.00%> (-5.78%) ⬇️
.../main/java/org/opensearch/ad/NodeStateManager.java 0.64% <0.00%> (-0.13%) ⬇️
.../org/opensearch/ad/feature/CompositeRetriever.java 0.00% <ø> (ø)
...va/org/opensearch/ad/feature/SearchFeatureDao.java 4.70% <0.00%> (ø)
.../main/java/org/opensearch/ad/ml/CheckpointDao.java 61.14% <0.00%> (ø)
...ensearch/ad/rest/RestAnomalyDetectorJobAction.java 0.00% <0.00%> (ø)
.../handler/AbstractAnomalyDetectorActionHandler.java 16.86% <0.00%> (-0.13%) ⬇️
... and 5 more

... and 6 files with indirect coverage changes

private static final Logger LOG = LogManager.getLogger(ProfileTransportAction.class);
private ModelManager modelManager;
private FeatureManager featureManager;
private CacheProvider cacheProvider;
private SDKClusterService clusterService;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterService to sdkClusterService.
client to sdkRestClient.

Please ensure the naming convention on variables.

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. With few comments

src/main/java/org/opensearch/ad/ml/CheckpointDao.java Outdated Show resolved Hide resolved
@@ -8,127 +8,9 @@
* Modifications Copyright OpenSearch Contributors. See
* GitHub history for details.
*/

/* @anomaly.detection Commented until we have extension support for hashring : https://github.com/opensearch-project/opensearch-sdk-java/issues/200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commented this out for the hashring since it set up test nodes and registers handlers for these nodes. The hashring is then used to stub calls to return the test nodes. Since we're temporarily rerouting all requests back to the extension node, most of these test classes dont even apply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking this PR for the above but we should think about the tests related to HashRing. We can modify the tests for extensionNode.

… found message rather than catching OpenSearchStatusException

Signed-off-by: Joshua Palis <[email protected]>
Signed-off-by: Joshua Palis <[email protected]>
Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! If @dbwiddis can also take a look!

Comment on lines +775 to +780
'org.opensearch.ad.transport.ADResultBulkTransportAction',
'org.opensearch.ad.transport.ADResultBulkRequest',
'org.opensearch.ad.transport.ADResultBulkAction',
'org.opensearch.ad.ratelimit.ResultWriteRequest',
'org.opensearch.ad.AnomalyDetectorJobRunner.1',
'org.opensearch.ad.AnomalyDetectorJobRunner.2',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why these classes are not having test coverage?

Copy link
Member Author

@joshpalis joshpalis Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test classes regarding ADResultBulkTransportAction/ResultWriteWorker have been commented out for this PR temporarily, I will handle them in this PR to enable HCAD real time analysis, as the ADResultBulkTransportAction is used to index multi entity Anomaly Results. This current PR includes some preparational work for the HCAD workflow

@dbwiddis dbwiddis merged commit 9c0a308 into opensearch-project:feature/extensions Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants