Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #711

Merged
merged 30 commits into from
Feb 20, 2023
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
77a1ffb
[wjunshen] #N/A feat: fix after the latest rebase
wujunshen Jan 25, 2023
9f50458
[wjunshen] #N/A feat: fix after rebase
wujunshen Jan 25, 2023
ddfb117
[wjunshen] #N/A feat: fix after rebase
wujunshen Jan 25, 2023
0c235a6
[wjunshen] #N/A feat: fix after rebase
wujunshen Jan 25, 2023
35a3703
[wjunshen] #N/A feat: fix after the latest rebase
wujunshen Jan 25, 2023
350faed
Increment version to 2.6.0-SNAPSHOT (#671)
opensearch-trigger-bot[bot] Jan 25, 2023
5126aba
fix profile API in example doc (#712)
ylwu-amzn Jan 25, 2023
c22980c
change model url to public repo in text embedding model example doc (…
ylwu-amzn Jan 26, 2023
a0f1df5
Enhance profile API to add model centric result controlled by view pa…
zane-neo Jan 31, 2023
f62ad71
add planning work nodes to model (#715)
ylwu-amzn Jan 31, 2023
eaf794d
skip running syncup job if no model index (#717)
ylwu-amzn Jan 31, 2023
3f710da
refactor: add DL model class (#722)
ylwu-amzn Feb 6, 2023
da42086
tune model config: change pooling mode to optional (#724)
ylwu-amzn Feb 6, 2023
699e06a
[wjunshen] #N/A feat: make the log readable
wujunshen Feb 8, 2023
0365674
[wjunshen] #N/A feat: add error log
wujunshen Feb 8, 2023
d779f8c
[wjunshen] #N/A feat: Refer to PR #717,just checking if index exists …
wujunshen Feb 9, 2023
9fa1025
[wjunshen] #N/A feat: change RunTimeException to MLException
wujunshen Feb 9, 2023
beef20f
[wjunshen] #N/A feat: also consider COMPLETED_WITH_ERROR
wujunshen Feb 9, 2023
5f3c2cc
[wjunshen] #N/A feat: remove ML_MODEL_RELOAD_MAX_RETRY_TIMES in Commo…
wujunshen Feb 9, 2023
facc4a1
[wjunshen] #N/A feat: remove Result class
wujunshen Feb 9, 2023
7356a82
[wjunshen] #N/A feat: change "reload" and "retry" to a full word
wujunshen Feb 9, 2023
ddab41c
[wjunshen] #N/A feat: change log info sentence
wujunshen Feb 9, 2023
76fb7f0
[wjunshen] #N/A feat: code format
wujunshen Feb 9, 2023
c0b575e
Merge branch 'opensearch-project:2.x' into 2.x
wujunshen Feb 9, 2023
78ae922
[Signed-off-by: wjunshen<[email protected]>] #N/A feat:
wujunshen Feb 16, 2023
f169ec1
[Signed-off-by: wjunshen<[email protected]>] #N/A feat:
wujunshen Feb 17, 2023
5436e6d
[Signed-off-by: wjunshen<[email protected]>] #N/A feat:
wujunshen Feb 17, 2023
71e5645
[Signed-off-by: wjunshen<[email protected]>] #N/A feat:
wujunshen Feb 17, 2023
38bf342
[Signed-off-by: wjunshen<[email protected]>] #N/A feat:
wujunshen Feb 17, 2023
467250e
[Signed-off-by: wjunshen<[email protected]>] #N/A feat:
wujunshen Feb 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ public class CommonValue {
public static final String ML_TASK_INDEX = ".plugins-ml-task";
public static final Integer ML_MODEL_INDEX_SCHEMA_VERSION = 3;
public static final Integer ML_TASK_INDEX_SCHEMA_VERSION = 1;

public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question: Is it possible to avoid using a new index to achieve auto reload? Can we just query the Task index and find out all the loaded models in the current node and reload them all after OS started? I may missed some discussion earlier, but it looks like the retry number and search results can be stored locally in the memory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this ml node has been restarted for some unknown reason, I can still use the persistent retryTimes value to know how many times the models on this node have been auto-reloaded before, then decide whether to do auto-reload this time. but if it is placed in cache, I can't get this info and have to auto-reload again. both are compared. The former may have some performance improvement

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is a trade off for performance improvement by using a lot more resources. Is it possible to define this auto_reload as a ml_task and reuse the ml_task index to store the retry_times? Adding 2 new fields in ml_task may be much cheaper than using a new index. Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the communication with you and charlie, we will elaborate

public static final String NODE_ID_FIELD = "node_id";
public static final String MODEL_LOAD_RETRY_TIMES_FIELD = "retry_times";
public static final String USER_FIELD_MAPPING = " \""
+ CommonValue.USER
+ "\": {\n"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/
package org.opensearch.ml.model;

import static org.opensearch.common.xcontent.XContentParserUtils.ensureExpectedToken;
import static org.opensearch.ml.common.CommonValue.ML_MODEL_RELOAD_INDEX;
import static org.opensearch.ml.common.CommonValue.ML_TASK_INDEX;
import static org.opensearch.ml.common.CommonValue.MODEL_LOAD_RETRY_TIMES_FIELD;
import static org.opensearch.ml.common.CommonValue.NODE_ID_FIELD;
import static org.opensearch.ml.settings.MLCommonsSettings.ML_COMMONS_MODEL_AUTO_RELOAD_ENABLE;
import static org.opensearch.ml.settings.MLCommonsSettings.ML_MODEL_RELOAD_MAX_RETRY_TIMES;
import static org.opensearch.ml.utils.MLNodeUtils.createXContentParserFromRegistry;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ExecutionException;

import lombok.extern.log4j.Log4j2;

import org.apache.commons.lang3.exception.ExceptionUtils;
import org.opensearch.action.ActionListener;
import org.opensearch.action.StepListener;
import org.opensearch.action.index.IndexAction;
import org.opensearch.action.index.IndexRequestBuilder;
import org.opensearch.action.index.IndexResponse;
import org.opensearch.action.search.SearchAction;
import org.opensearch.action.search.SearchRequestBuilder;
import org.opensearch.action.search.SearchResponse;
import org.opensearch.action.support.WriteRequest;
import org.opensearch.client.Client;
import org.opensearch.cluster.service.ClusterService;
import org.opensearch.common.settings.Settings;
import org.opensearch.common.util.CollectionUtils;
import org.opensearch.common.xcontent.NamedXContentRegistry;
import org.opensearch.common.xcontent.XContentParser;
import org.opensearch.index.IndexNotFoundException;
import org.opensearch.index.query.QueryBuilder;
import org.opensearch.index.query.QueryBuilders;
import org.opensearch.ml.cluster.DiscoveryNodeHelper;
import org.opensearch.ml.common.MLTask;
import org.opensearch.ml.common.exception.MLException;
import org.opensearch.ml.common.transport.load.MLLoadModelAction;
import org.opensearch.ml.common.transport.load.MLLoadModelRequest;
import org.opensearch.ml.utils.MLNodeUtils;
import org.opensearch.rest.RestStatus;
import org.opensearch.search.SearchHit;
import org.opensearch.search.builder.SearchSourceBuilder;
import org.opensearch.search.sort.FieldSortBuilder;
import org.opensearch.search.sort.SortBuilder;
import org.opensearch.search.sort.SortOrder;
import org.opensearch.threadpool.ThreadPool;

import com.google.common.annotations.VisibleForTesting;

/**
* Manager class for ML models and nodes. It contains ML model auto reload operations etc.
*/
@Log4j2
public class MLModelAutoReloader {

private final Client client;
private final ClusterService clusterService;
private final NamedXContentRegistry xContentRegistry;
private final DiscoveryNodeHelper nodeHelper;
private final ThreadPool threadPool;
private volatile Boolean enableAutoReloadModel;
private volatile Integer autoReloadMaxRetryTimes;

/**
* constructor method, init all the params necessary for model auto reloading
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This after constructor method is not US-ASCII. That will cause infra team's CI workflow failure. #736

*
* @param clusterService clusterService
* @param threadPool threadPool
* @param client client
* @param xContentRegistry xContentRegistry
* @param nodeHelper nodeHelper
* @param settings settings
*/
public MLModelAutoReloader(
ClusterService clusterService,
ThreadPool threadPool,
Client client,
NamedXContentRegistry xContentRegistry,
DiscoveryNodeHelper nodeHelper,
Settings settings
) {
this.clusterService = clusterService;
this.client = client;
this.xContentRegistry = xContentRegistry;
this.nodeHelper = nodeHelper;
this.threadPool = threadPool;

enableAutoReloadModel = ML_COMMONS_MODEL_AUTO_RELOAD_ENABLE.get(settings);
autoReloadMaxRetryTimes = ML_MODEL_RELOAD_MAX_RETRY_TIMES.get(settings);
clusterService
.getClusterSettings()
.addSettingsUpdateConsumer(ML_COMMONS_MODEL_AUTO_RELOAD_ENABLE, it -> enableAutoReloadModel = it);

clusterService.getClusterSettings().addSettingsUpdateConsumer(ML_MODEL_RELOAD_MAX_RETRY_TIMES, it -> autoReloadMaxRetryTimes = it);
}

/**
* the main method: model auto reloading
*/
public void autoReloadModel() {
log.info("auto reload model enabled: {} ", enableAutoReloadModel);

// if we don't need to reload automatically, just return without doing anything
if (!enableAutoReloadModel) {
return;
}

// At opensearch startup, get local node id, if not ml node,we ignored, just return without doing anything
if (!MLNodeUtils.isMLNode(clusterService.localNode())) {
return;
}

String localNodeId = clusterService.localNode().getId();
// auto reload all models of this local ml node
threadPool.generic().submit(() -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this threadpool to UPLOAD_THREAD_POOL in MachineLearningPlugin since this is dedicated for uploading models.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

en ,yaliang have said it~ I have committed the latest code

try {
autoReloadModelByNodeId(localNodeId);
} catch (ExecutionException | InterruptedException e) {
log
.error(
"the model auto-reloading has exception,and the root cause message is: {}",
ExceptionUtils.getRootCauseMessage(e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about print the full exception stack trace here? Just print out the root cause seems not easy to debug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I use ExceptionUtils.getMessage(e)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a while,and changed it to ExceptionUtils.getStackTrace(e) at last.

Copy link
Collaborator

@zane-neo zane-neo Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print the entire exception stack is useful and convenient to locate and debug issues, you can change to log .error("the model auto-reloading has exception,and the root cause message is: {}", e)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool~I will modify it according to what you said

);
throw new MLException(e);
}
});
}

/**
* auto reload all the models under the node id<br> the node must be a ml node<br>
*
* @param localNodeId node id
*/
@VisibleForTesting
void autoReloadModelByNodeId(String localNodeId) throws ExecutionException, InterruptedException {
StepListener<SearchResponse> queryTaskStep = new StepListener<>();
StepListener<SearchResponse> getRetryTimesStep = new StepListener<>();
StepListener<IndexResponse> saveLatestRetryTimesStep = new StepListener<>();

if (!clusterService.state().metadata().indices().containsKey(ML_TASK_INDEX)) {
// ML_TASK_INDEX did not exist,do nothing
return;
}

queryTask(localNodeId, ActionListener.wrap(queryTaskStep::onResponse, queryTaskStep::onFailure));

getRetryTimes(localNodeId, ActionListener.wrap(getRetryTimesStep::onResponse, getRetryTimesStep::onFailure));

queryTaskStep.whenComplete(searchResponse -> {
SearchHit[] hits = searchResponse.getHits().getHits();
if (CollectionUtils.isEmpty(hits)) {
return;
}

getRetryTimesStep.whenComplete(getReTryTimesResponse -> {
int retryTimes = 0;
// if getReTryTimesResponse is null,it means we get retryTimes at the first time,and the index
// .plugins-ml-model-reload doesn't exist,so we should let retryTimes be zero(init value)
// we don't do anything
// if getReTryTimesResponse is not null,it means we have saved the value of retryTimes into the index
// .plugins-ml-model-reload,so we get the value of the field MODEL_LOAD_RETRY_TIMES_FIELD
if (getReTryTimesResponse != null) {
Map<String, Object> sourceAsMap = getReTryTimesResponse.getHits().getHits()[0].getSourceAsMap();
retryTimes = (Integer) sourceAsMap.get(MODEL_LOAD_RETRY_TIMES_FIELD);
}

// According to the node id to get retry times, if more than the max retry times, don't need to retry
// that the number of unsuccessful reload has reached the maximum number of times, do not need to reload
if (retryTimes > autoReloadMaxRetryTimes) {
log.info("Node: {} has reached to the max retry limit, failed to load models", localNodeId);
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before return, should we check how long the node has been in the max retry status and reset to 0 after a substantial time? It looks to me the node will never reload forever once reached maximum retry times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the first comment of yours, I found ml_task index and ml_model index are both definition in ml_task index, so if I add 2 new fields in ml_task,the ml_model index will have these 2 fields,too. It sounds that give ml_model index redundant attributes.
in the second comment of yours, when we discussed the design earlier, if the maximum retry times is reached, instead of automatically reloading, the model need to be loaded manually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let's keep this logic then. But we should try to define a new type of ML Task for auto reload, and reuse MLTask to store the max_retry field, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks~

}

try (XContentParser parser = createXContentParserFromRegistry(xContentRegistry, hits[0].getSourceRef())) {
ensureExpectedToken(XContentParser.Token.START_OBJECT, parser.nextToken(), parser);
MLTask mlTask = MLTask.parse(parser);

autoReloadModelByNodeAndModelId(localNodeId, mlTask.getModelId());

// if reload the model successfully,the number of unsuccessful reload should be reset to zero.
retryTimes = 0;
} catch (MLException e) {
retryTimes++;
log.error("Can't auto reload model in node id {} ,has tried {} times\nThe reason is:{}", localNodeId, retryTimes, e);
}

// Store the latest value of the retryTimes and node id under the index ".plugins-ml-model-reload"
saveLatestRetryTimes(
localNodeId,
retryTimes,
ActionListener.wrap(saveLatestRetryTimesStep::onResponse, saveLatestRetryTimesStep::onFailure)
);
}, getRetryTimesStep::onFailure);
}, queryTaskStep::onFailure);

saveLatestRetryTimesStep.whenComplete(response -> log.info("successfully complete all steps"), saveLatestRetryTimesStep::onFailure);
}

/**
* auto reload 1 model under the node id
*
* @param localNodeId node id
* @param modelId model id
*/
@VisibleForTesting
void autoReloadModelByNodeAndModelId(String localNodeId, String modelId) throws MLException {
String[] allNodeIds = nodeHelper.getAllNodeIds();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeHelper.getAllNodeIds() will return all nodes , not just ML nodes. Should we reload model on all nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will modify it. Let the collection just have all ids of ml node.

List<String> allNodeIdList = new ArrayList<>(List.of(allNodeIds));
if (!allNodeIdList.contains(localNodeId)) {
allNodeIdList.add(localNodeId);
}
MLLoadModelRequest mlLoadModelRequest = new MLLoadModelRequest(modelId, allNodeIdList.toArray(new String[] {}), false, false);

client
.execute(
MLLoadModelAction.INSTANCE,
mlLoadModelRequest,
ActionListener
.wrap(response -> log.info("the model {} is auto reloading under the node {} ", modelId, localNodeId), exception -> {
log.error("fail to reload model " + modelId + " under the node " + localNodeId + "\nthe reason is: " + exception);
throw new MLException(
"fail to reload model " + modelId + " under the node " + localNodeId + "\nthe reason is: " + exception
);
})
);
}

/**
* query task index, and get the result of "task_type"="LOAD_MODEL" and "state"="COMPLETED" and
* "worker_node" match nodeId
*
* @param localNodeId one of query condition
*/
@VisibleForTesting
void queryTask(String localNodeId, ActionListener<SearchResponse> searchResponseActionListener) {
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder().from(0).size(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query only return the latest load model task. If user have 3 models, this query only return 1 latest load model task for 1 model, the other 2 models' tasks won't be returned. So we can't reload all 3 models, just reload 1 model. Is that correct?


QueryBuilder queryBuilder = QueryBuilders
.boolQuery()
.must(QueryBuilders.matchPhraseQuery("task_type", "LOAD_MODEL"))
.must(QueryBuilders.matchPhraseQuery("worker_node", localNodeId))
.must(
QueryBuilders
.boolQuery()
.should(QueryBuilders.matchPhraseQuery("state", "COMPLETED"))
.should(QueryBuilders.matchPhraseQuery("state", "COMPLETED_WITH_ERROR"))
);
searchSourceBuilder.query(queryBuilder);

SortBuilder<FieldSortBuilder> sortBuilderOrder = new FieldSortBuilder("create_time").order(SortOrder.DESC);
searchSourceBuilder.sort(sortBuilderOrder);

SearchRequestBuilder searchRequestBuilder = new SearchRequestBuilder(client, SearchAction.INSTANCE)
.setIndices(ML_TASK_INDEX)
.setSource(searchSourceBuilder);

searchRequestBuilder.execute(ActionListener.wrap(searchResponseActionListener::onResponse, exception -> {
log.error("index {} not found, the reason is {}", ML_TASK_INDEX, exception);
throw new IndexNotFoundException("index " + ML_TASK_INDEX + " not found");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't confirm this is IndexNotFoundException, please throw a MLException instead, and please wrap the original exception into the MLException like this:throw new MLException(exception)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I have changed it~

}));
}

/**
* get retry times from the index ".plugins-ml-model-reload" by 1 ml node
*
* @param localNodeId the filter condition to query
*/
@VisibleForTesting
void getRetryTimes(String localNodeId, ActionListener<SearchResponse> searchResponseActionListener) {
if (!clusterService.state().metadata().indices().containsKey(ML_MODEL_RELOAD_INDEX)) {
// ML_MODEL_RELOAD_INDEX did not exist, it means it is our first time to do model auto-reloading operation
searchResponseActionListener.onResponse(null);
}

SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.fetchSource(new String[] { MODEL_LOAD_RETRY_TIMES_FIELD }, null);
QueryBuilder queryBuilder = QueryBuilders.idsQuery().addIds(localNodeId);
searchSourceBuilder.query(queryBuilder);

SearchRequestBuilder searchRequestBuilder = new SearchRequestBuilder(client, SearchAction.INSTANCE)
.setIndices(ML_MODEL_RELOAD_INDEX)
.setSource(searchSourceBuilder);

searchRequestBuilder.execute(ActionListener.wrap(searchResponse -> {
SearchHit[] hits = searchResponse.getHits().getHits();
if (CollectionUtils.isEmpty(hits)) {
searchResponseActionListener.onResponse(null);
return;
}

searchResponseActionListener.onResponse(searchResponse);
}, searchResponseActionListener::onFailure));
}

/**
* save retry times
* @param localNodeId node id
* @param retryTimes actual retry times
*/
@VisibleForTesting
void saveLatestRetryTimes(String localNodeId, int retryTimes, ActionListener<IndexResponse> indexResponseActionListener) {
Map<String, Object> content = new HashMap<>(2);
content.put(NODE_ID_FIELD, localNodeId);
content.put(MODEL_LOAD_RETRY_TIMES_FIELD, retryTimes);

IndexRequestBuilder indexRequestBuilder = new IndexRequestBuilder(client, IndexAction.INSTANCE, ML_MODEL_RELOAD_INDEX)
.setId(localNodeId)
.setSource(content)
.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);

indexRequestBuilder.execute(ActionListener.wrap(indexResponse -> {
if (indexResponse.status() == RestStatus.CREATED || indexResponse.status() == RestStatus.OK) {
log.info("node id:{} insert retry times successfully", localNodeId);
indexResponseActionListener.onResponse(indexResponse);
return;
}
indexResponseActionListener.onFailure(new MLException("node id:" + localNodeId + " insert retry times unsuccessfully"));
}, indexResponseActionListener::onFailure));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add logs here when receiving indexRequestBuilder.execute exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done~

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
import org.opensearch.ml.engine.algorithms.sample.LocalSampleCalculator;
import org.opensearch.ml.indices.MLIndicesHandler;
import org.opensearch.ml.indices.MLInputDatasetHandler;
import org.opensearch.ml.model.MLModelAutoReloader;
import org.opensearch.ml.model.MLModelCacheHelper;
import org.opensearch.ml.model.MLModelManager;
import org.opensearch.ml.rest.RestMLCreateModelMetaAction;
Expand Down Expand Up @@ -175,6 +176,8 @@ public class MachineLearningPlugin extends Plugin implements ActionPlugin {
private MLModelChunkUploader mlModelChunkUploader;
private MLEngine mlEngine;

private MLModelAutoReloader mlModelAutoReloader;

private Client client;
private ClusterService clusterService;
private ThreadPool threadPool;
Expand Down Expand Up @@ -352,6 +355,9 @@ public Collection<Object> createComponents(
mlIndicesHandler
);

mlModelAutoReloader = new MLModelAutoReloader(clusterService, threadPool, client, xContentRegistry, nodeHelper, settings);
mlModelAutoReloader.autoReloadModel();

return ImmutableList
.of(
mlEngine,
Expand All @@ -373,7 +379,8 @@ public Collection<Object> createComponents(
modelHelper,
mlCommonsClusterEventListener,
clusterManagerEventListener,
mlCircuitBreakerService
mlCircuitBreakerService,
mlModelAutoReloader
);
}

Expand Down Expand Up @@ -513,7 +520,9 @@ public List<Setting<?>> getSettings() {
MLCommonsSettings.ML_COMMONS_MAX_ML_TASK_PER_NODE,
MLCommonsSettings.ML_COMMONS_MAX_LOAD_MODEL_TASKS_PER_NODE,
MLCommonsSettings.ML_COMMONS_TRUSTED_URL_REGEX,
MLCommonsSettings.ML_COMMONS_NATIVE_MEM_THRESHOLD
MLCommonsSettings.ML_COMMONS_NATIVE_MEM_THRESHOLD,
MLCommonsSettings.ML_COMMONS_MODEL_AUTO_RELOAD_ENABLE,
MLCommonsSettings.ML_MODEL_RELOAD_MAX_RETRY_TIMES
);
return settings;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,10 @@ private MLCommonsSettings() {}

public static final Setting<Integer> ML_COMMONS_NATIVE_MEM_THRESHOLD = Setting
.intSetting("plugins.ml_commons.native_memory_threshold", 90, 0, 100, Setting.Property.NodeScope, Setting.Property.Dynamic);

public static final Setting<Boolean> ML_COMMONS_MODEL_AUTO_RELOAD_ENABLE = Setting
.boolSetting("plugins.ml_commons.model.autoreload.enable", false, Setting.Property.NodeScope, Setting.Property.Dynamic);

public static final Setting<Integer> ML_MODEL_RELOAD_MAX_RETRY_TIMES = Setting
.intSetting("plugins.ml_commons.model.autoreload.retrytimes", 3, Setting.Property.NodeScope, Setting.Property.Dynamic);
}
Loading