Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fail model deployment if all allocations cannot be provided #88656

Conversation

dimitris-athanasiou
Copy link
Contributor

When we cannot scale up (autoscaling is disabled) we should fail
requests to start a trained model deployment whose allocations
cannot be provided.

When we cannot scale up (autoscaling is disabled) we should fail
requests to start a trained model deployment whose allocations
cannot be provided.
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 20, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent self-requested a review July 20, 2022 16:07
Comment on lines 513 to 519
if (maxLazyMLNodes <= nodes.size()
&& trainedModelAssignment.isSatisfied(nodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet())) == false) {
String msg = "Could not start deployment because there are not enough resources to provide all requested allocations";
logger.debug(() -> format("[%s] %s", modelId, msg));
exception = new ElasticsearchStatusException(msg, RestStatus.TOO_MANY_REQUESTS);
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check to make sure that the current node size is less than the max node size (vertical scaling vs horizontal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. I've added a commit that factors out a isScalingPossible and uses it in both places we make that check.

@@ -527,6 +532,15 @@ public boolean test(ClusterState clusterState) {
);
return false;
}

private boolean isScalingPossible(List<DiscoveryNode> nodes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add a comment that this only considers memory, and in the future it would be nice to consider CPU too.

It means there's a discrepancy in how we handle different situations:

  • If autoscaling is enabled, a cluster is scaled to maximum size, there are 2 free CPUs, and you ask to start a deployment that needs 3 CPUs then you get told you cannot.
  • If autoscaling is enabled, a cluster is scaled to one step below its maximum size, there are no free CPUs, and you ask to start a deployment that needs 100000 CPUs then that's fine - we start it, the cluster scales to its maximum size and the deployment goes ahead but with far fewer than 100000 CPUs allocated to it.

While autoscaling doesn't understand CPU there's not a lot we can do about this, but it's worth at least adding comments to acknowledge where we are today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a TODO comment.

@dimitris-athanasiou dimitris-athanasiou merged commit 154b924 into elastic:master Jul 21, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the fail-starting-deployment-if-model-cannot-be-satisfied branch July 21, 2022 12:27
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 22, 2022
* upstream/master: (40 commits)
  Fix CI job naming
  [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623)
  Add "Vector Search" area to changelog schema
  [DOCS] Update API key API (elastic#88499)
  Enable the pipeline on the feature branch (elastic#88672)
  Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626)
  [DOCS] Fix transform painless example syntax (elastic#88364)
  [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685)
  Fix double rounding errors for disk usage (elastic#88683)
  Replace health request with a state observer. (elastic#88641)
  [ML] Fail model deployment if all allocations cannot be provided (elastic#88656)
  Upgrade to OpenJDK 18.0.2+9 (elastic#88675)
  [ML] make bucket_correlation aggregation generally available (elastic#88655)
  Adding cardinality support for random_sampler agg (elastic#86838)
  Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643)
  Reinstate test cluster throttling behavior (elastic#88664)
  Mute testReadBlobWithPrematureConnectionClose
  Simplify plugin descriptor tests (elastic#88659)
  Add CI job for testing more job parallelism
  [ML] make deployment infer requests fully cancellable (elastic#88649)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants