-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fail model deployment if all allocations cannot be provided #88656
[ML] Fail model deployment if all allocations cannot be provided #88656
Conversation
When we cannot scale up (autoscaling is disabled) we should fail requests to start a trained model deployment whose allocations cannot be provided.
Pinging @elastic/ml-core (Team:ML) |
if (maxLazyMLNodes <= nodes.size() | ||
&& trainedModelAssignment.isSatisfied(nodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet())) == false) { | ||
String msg = "Could not start deployment because there are not enough resources to provide all requested allocations"; | ||
logger.debug(() -> format("[%s] %s", modelId, msg)); | ||
exception = new ElasticsearchStatusException(msg, RestStatus.TOO_MANY_REQUESTS); | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also check to make sure that the current node size is less than the max node size (vertical scaling vs horizontal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted. I've added a commit that factors out a isScalingPossible
and uses it in both places we make that check.
@@ -527,6 +532,15 @@ public boolean test(ClusterState clusterState) { | |||
); | |||
return false; | |||
} | |||
|
|||
private boolean isScalingPossible(List<DiscoveryNode> nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add a comment that this only considers memory, and in the future it would be nice to consider CPU too.
It means there's a discrepancy in how we handle different situations:
- If autoscaling is enabled, a cluster is scaled to maximum size, there are 2 free CPUs, and you ask to start a deployment that needs 3 CPUs then you get told you cannot.
- If autoscaling is enabled, a cluster is scaled to one step below its maximum size, there are no free CPUs, and you ask to start a deployment that needs 100000 CPUs then that's fine - we start it, the cluster scales to its maximum size and the deployment goes ahead but with far fewer than 100000 CPUs allocated to it.
While autoscaling doesn't understand CPU there's not a lot we can do about this, but it's worth at least adding comments to acknowledge where we are today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a TODO
comment.
* upstream/master: (40 commits) Fix CI job naming [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623) Add "Vector Search" area to changelog schema [DOCS] Update API key API (elastic#88499) Enable the pipeline on the feature branch (elastic#88672) Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626) [DOCS] Fix transform painless example syntax (elastic#88364) [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685) Fix double rounding errors for disk usage (elastic#88683) Replace health request with a state observer. (elastic#88641) [ML] Fail model deployment if all allocations cannot be provided (elastic#88656) Upgrade to OpenJDK 18.0.2+9 (elastic#88675) [ML] make bucket_correlation aggregation generally available (elastic#88655) Adding cardinality support for random_sampler agg (elastic#86838) Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643) Reinstate test cluster throttling behavior (elastic#88664) Mute testReadBlobWithPrematureConnectionClose Simplify plugin descriptor tests (elastic#88659) Add CI job for testing more job parallelism [ML] make deployment infer requests fully cancellable (elastic#88649) ...
When we cannot scale up (autoscaling is disabled) we should fail
requests to start a trained model deployment whose allocations
cannot be provided.