[ML] Fail model deployment if all allocations cannot be provided #88656

dimitris-athanasiou · 2022-07-20T16:03:09Z

When we cannot scale up (autoscaling is disabled) we should fail
requests to start a trained model deployment whose allocations
cannot be provided.

When we cannot scale up (autoscaling is disabled) we should fail requests to start a trained model deployment whose allocations cannot be provided.

elasticsearchmachine · 2022-07-20T16:03:33Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2022-07-20T16:09:57Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

+            if (maxLazyMLNodes <= nodes.size()
+                && trainedModelAssignment.isSatisfied(nodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet())) == false) {
+                String msg = "Could not start deployment because there are not enough resources to provide all requested allocations";
+                logger.debug(() -> format("[%s] %s", modelId, msg));
+                exception = new ElasticsearchStatusException(msg, RestStatus.TOO_MANY_REQUESTS);
+                return true;
+            }


We should also check to make sure that the current node size is less than the max node size (vertical scaling vs horizontal).

Well spotted. I've added a commit that factors out a isScalingPossible and uses it in both places we make that check.

droberts195 · 2022-07-21T10:11:47Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

@@ -527,6 +532,15 @@ public boolean test(ClusterState clusterState) {
            );
            return false;
        }
+
+        private boolean isScalingPossible(List<DiscoveryNode> nodes) {


I think you should add a comment that this only considers memory, and in the future it would be nice to consider CPU too.

It means there's a discrepancy in how we handle different situations:

If autoscaling is enabled, a cluster is scaled to maximum size, there are 2 free CPUs, and you ask to start a deployment that needs 3 CPUs then you get told you cannot.

If autoscaling is enabled, a cluster is scaled to one step below its maximum size, there are no free CPUs, and you ask to start a deployment that needs 100000 CPUs then that's fine - we start it, the cluster scales to its maximum size and the deployment goes ahead but with far fewer than 100000 CPUs allocated to it.

While autoscaling doesn't understand CPU there's not a lot we can do about this, but it's worth at least adding comments to acknowledge where we are today.

I have added a TODO comment.

* upstream/master: (40 commits) Fix CI job naming [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623) Add "Vector Search" area to changelog schema [DOCS] Update API key API (elastic#88499) Enable the pipeline on the feature branch (elastic#88672) Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626) [DOCS] Fix transform painless example syntax (elastic#88364) [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685) Fix double rounding errors for disk usage (elastic#88683) Replace health request with a state observer. (elastic#88641) [ML] Fail model deployment if all allocations cannot be provided (elastic#88656) Upgrade to OpenJDK 18.0.2+9 (elastic#88675) [ML] make bucket_correlation aggregation generally available (elastic#88655) Adding cardinality support for random_sampler agg (elastic#86838) Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643) Reinstate test cluster throttling behavior (elastic#88664) Mute testReadBlobWithPrematureConnectionClose Simplify plugin descriptor tests (elastic#88659) Add CI job for testing more job parallelism [ML] make deployment infer requests fully cancellable (elastic#88649) ...

[ML] Fail model deployment if all allocations cannot be provided

73b8dbb

When we cannot scale up (autoscaling is disabled) we should fail requests to start a trained model deployment whose allocations cannot be provided.

dimitris-athanasiou added >non-issue :ml Machine learning v8.4.0 labels Jul 20, 2022

elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 20, 2022

benwtrent self-requested a review July 20, 2022 16:07

benwtrent reviewed Jul 20, 2022

View reviewed changes

Also check we can scale vertically

ddf53f7

droberts195 reviewed Jul 21, 2022

View reviewed changes

Add TODO about scaling by CPU

be96721

benwtrent approved these changes Jul 21, 2022

View reviewed changes

dimitris-athanasiou merged commit 154b924 into elastic:master Jul 21, 2022

dimitris-athanasiou deleted the fail-starting-deployment-if-model-cannot-be-satisfied branch July 21, 2022 12:27

dimitris-athanasiou mentioned this pull request Jul 28, 2022

[ML] Previously assigned models should get at least one allocation #88855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fail model deployment if all allocations cannot be provided #88656

[ML] Fail model deployment if all allocations cannot be provided #88656

dimitris-athanasiou commented Jul 20, 2022

elasticsearchmachine commented Jul 20, 2022

benwtrent Jul 20, 2022

dimitris-athanasiou Jul 21, 2022

droberts195 Jul 21, 2022

dimitris-athanasiou Jul 21, 2022

[ML] Fail model deployment if all allocations cannot be provided #88656

[ML] Fail model deployment if all allocations cannot be provided #88656

Conversation

dimitris-athanasiou commented Jul 20, 2022

elasticsearchmachine commented Jul 20, 2022

benwtrent Jul 20, 2022

Choose a reason for hiding this comment

dimitris-athanasiou Jul 21, 2022

Choose a reason for hiding this comment

droberts195 Jul 21, 2022

Choose a reason for hiding this comment

dimitris-athanasiou Jul 21, 2022

Choose a reason for hiding this comment