-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Previously assigned models should get at least one allocation #88855
[ML] Previously assigned models should get at least one allocation #88855
Conversation
Pinging @elastic/ml-core (Team:ML) |
...rc/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlanner.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlanner.java
Show resolved
Hide resolved
When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models.
d6d83b8
to
6c35bee
Compare
@benwtrent I have done some renaming which I hope makes things a bit clearer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good for @wwang500 to confirm this satisfies the bug fix.
This disables the validation that we can fully allocate a model deployment on start-up. We want to test a specific scenario before merging the PR which that validation makes it much harder to test. Will revert this before merging.
@elasticmachine update branch |
…st-one-allocation
@elasticmachine update branch |
…st-one-allocation
@elasticmachine update branch |
…st-one-allocation
This reverts commit df49287.
This has now been tested. I'll proceed to merge and backport. |
…on (elastic#88855) When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models. Backport of elastic#88855
…on (#88855) (#89068) When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models. Backport of #88855
When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.
Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.
In order to know a model has previously been allocated, this commit adds a field
to
TrainedModelAssignment
calledmax_assigned_allocations
which records themax number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.
Finally, we modify the
AssignmentPlanner
so that after computing a plan wecheck whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.