dispatch ML task to ML node first #346

ylwu-amzn · 2022-06-15T19:11:58Z

Signed-off-by: Yaliang Wu [email protected]

Description

We plan to support dynamic role in OpenSearch. Refer to this PR opensearch-project/OpenSearch#3436 and issue opensearch-project/OpenSearch#2877

This PR is to change the task dispatcher to support ML node. Will check if cluster has any node with "ml" role first. If yes, will dispatch ML task to "ml" nodes; otherwise will dispatch to data nodes.

Issues Resolved

close #79

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Yaliang Wu <[email protected]>

jackiehanyang · 2022-06-16T17:49:49Z

plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java

@@ -115,14 +113,7 @@ public class MachineLearningPlugin extends Plugin implements ActionPlugin {
    private ClusterService clusterService;
    private ThreadPool threadPool;

-    public static final Setting<Boolean> IS_ML_NODE_SETTING = Setting.boolSetting("node.ml", false, Setting.Property.NodeScope);


curiosity question: what is the purpose of IS_ML_NODE_SETTING before and why we don't need it now? I saw this part of logic was moved to TestHelper class, what's the reason for that?

That part was from prototype when we tried to support ML node. But actually it's not being used in our formal release as OpenSearch core doesn't support ML role and we have to postpone that. Now OpenSearch plan to support ML role with dynamic role feature in 2.1. We can add this back but we don't need this prototype/experiment code any more. Just move it to test part.

Zhangxunmt · 2022-06-16T18:31:48Z

plugin/src/main/java/org/opensearch/ml/task/MLTaskDispatcher.java

+            DiscoveryNode[] mlNodes = eligibleMLNodes.toArray(new DiscoveryNode[0]);
+            log.debug("Find {} dedicated ML nodes: {}", eligibleMLNodes.size(), Arrays.toString(mlNodes));
+            return mlNodes;
+        } else {


Nitpick: This "else" should be redundant.

It's code style preference. People have different preference for "No-else-after-return" or not, check https://stackoverflow.com/questions/46875442/unnecessary-else-after-return-no-else-return.

For me, I feel the code is more readable to keep else to make the returns have same indentation.

Zhangxunmt · 2022-06-16T18:43:17Z

plugin/src/main/java/org/opensearch/ml/task/MLTaskDispatcher.java

+    /**
+     * Get eligible node to run ML task. If there are nodes with ml role, will return all these
+     * ml nodes; otherwise return all data nodes.
+     *
+     * @return array of discovery node
+     */
+    protected DiscoveryNode[] getEligibleNodes() {


Is it only preferable to run ML tasks in ml node? I assume ml-common can run in data node as well. Also is there any logic in the ClusterState.nodes() to evaluate if any ml node is overloaded, etc? I just wonder, in the future, if we want to add more priority based strategy here to prioritize ML node, but still use data node if ML node is heavy loaded, etc.

Is it only preferable to run ML tasks in ml node? I assume ml-common can run in data node as well.

Check the comment "If there are nodes with ml role, will return all these ml nodes; otherwise return all data nodes."

Also is there any logic in the ClusterState.nodes() to evaluate if any ml node is overloaded, etc?

Yes, we check JVM heap usage and how many ML task running on a node. If exceeds limit, will not dispatch new ML task to that node.

I just wonder, in the future, if we want to add more priority based strategy here to prioritize ML node, but still use data node if ML node is heavy loaded, etc.

I think we'd better ask user to scale the cluster by adding more ML node or switch to more powerful node type if ML node is heavy/over loaded. But this is not the one way door, we can always tune the code if cx really needs to run model on data nodes if ML node overloaded.

Signed-off-by: Yaliang Wu <[email protected]> (cherry picked from commit 6cbb626)

Signed-off-by: Yaliang Wu <[email protected]> (cherry picked from commit 6cbb626) Co-authored-by: Yaliang Wu <[email protected]>

dispatch ML task to ML node first

7b1da08

Signed-off-by: Yaliang Wu <[email protected]>

ylwu-amzn requested a review from a team June 15, 2022 19:11

ylwu-amzn added backport 2.x backport 2.0 and removed backport 2.0 labels Jun 15, 2022

jackiehanyang reviewed Jun 16, 2022

View reviewed changes

jackiehanyang approved these changes Jun 16, 2022

View reviewed changes

Zhangxunmt reviewed Jun 16, 2022

View reviewed changes

Zhangxunmt approved these changes Jun 16, 2022

View reviewed changes

ylwu-amzn merged commit 6cbb626 into opensearch-project:main Jun 17, 2022

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 17, 2022

dispatch ML task to ML node first (#346)

58e2c97

Signed-off-by: Yaliang Wu <[email protected]> (cherry picked from commit 6cbb626)

opensearch-trigger-bot bot mentioned this pull request Jun 17, 2022

[Backport 2.x] dispatch ML task to ML node first #347

Merged

ylwu-amzn added a commit that referenced this pull request Jun 17, 2022

dispatch ML task to ML node first (#346) (#347)

b04caff

Signed-off-by: Yaliang Wu <[email protected]> (cherry picked from commit 6cbb626) Co-authored-by: Yaliang Wu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dispatch ML task to ML node first #346

dispatch ML task to ML node first #346

ylwu-amzn commented Jun 15, 2022 •

edited

Loading

jackiehanyang Jun 16, 2022 •

edited

Loading

ylwu-amzn Jun 16, 2022 •

edited

Loading

Zhangxunmt Jun 16, 2022

ylwu-amzn Jun 17, 2022

Zhangxunmt Jun 16, 2022 •

edited

Loading

ylwu-amzn Jun 17, 2022

dispatch ML task to ML node first #346

dispatch ML task to ML node first #346

Conversation

ylwu-amzn commented Jun 15, 2022 • edited Loading

Description

Issues Resolved

Check List

jackiehanyang Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

ylwu-amzn Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

Zhangxunmt Jun 16, 2022

Choose a reason for hiding this comment

ylwu-amzn Jun 17, 2022

Choose a reason for hiding this comment

Zhangxunmt Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

ylwu-amzn Jun 17, 2022

Choose a reason for hiding this comment

ylwu-amzn commented Jun 15, 2022 •

edited

Loading

jackiehanyang Jun 16, 2022 •

edited

Loading

ylwu-amzn Jun 16, 2022 •

edited

Loading

Zhangxunmt Jun 16, 2022 •

edited

Loading