[ML] Previously assigned models should get at least one allocation #88855

dimitris-athanasiou · 2022-07-27T15:01:57Z

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to TrainedModelAssignment called max_assigned_allocations which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the AssignmentPlanner so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

elasticsearchmachine · 2022-07-27T15:02:21Z

Pinging @elastic/ml-core (Team:ML)

...rc/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlanner.java

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models.

dimitris-athanasiou · 2022-07-28T11:10:04Z

@benwtrent I have done some renaming which I hope makes things a bit clearer.

benwtrent

It would be good for @wwang500 to confirm this satisfies the bug fix.

This disables the validation that we can fully allocate a model deployment on start-up. We want to test a specific scenario before merging the PR which that validation makes it much harder to test. Will revert this before merging.

dimitris-athanasiou · 2022-07-28T13:25:30Z

@wwang500 I have added a temporary commit to this PR where I disable the validation added in #88656 so you can repeat your test exactly as you did before. You can do so by using the docker image generated by the CI.

benwtrent · 2022-07-28T16:51:26Z

@elasticmachine update branch

…st-one-allocation

dimitris-athanasiou · 2022-07-29T10:54:16Z

@elasticmachine update branch

…st-one-allocation

dimitris-athanasiou · 2022-08-01T12:53:41Z

@elasticmachine update branch

…st-one-allocation

This reverts commit df49287.

dimitris-athanasiou · 2022-08-02T16:48:31Z

This has now been tested. I'll proceed to merge and backport.

…on (elastic#88855) When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models. Backport of elastic#88855

... of elastic#88855

... of #88855

…on (#88855) (#89068) When for some reason ML nodes are replaced (cluster resize, upgrade, etc.), it is possible that some models cannot be allocated at all. Then, while the cluster is temporarily undersized, all cores are given for allocations of the models that have survived. If those ML nodes return later, there may be model deployments that were previously allocated that now do not get any allocations. The reason is that our planner will try to preserve all current allocations. Operationally, this is not what serves best our users. Instead, as we are already in a cluster that does not have enough resources to fully allocate all model deployments, we should try to give at least one allocation to each model that has previously been allocated. In order to know a model has previously been allocated, this commit adds a field to `TrainedModelAssignment` called `max_assigned_allocations` which records the max number of allocations a deployment has received in its life. We can then use this to establish whether a deployment has ever been allocated. Finally, we modify the `AssignmentPlanner` so that after computing a plan we check whether the plan gives at least one allocation to all previously allocated models. If not, we then compute a plan that tries to give at least one allocation to each previously allocated model. We can solve this just using bin-packing. Having that plan we can invoke the planner one more time to optimize the rest of the allocations whilst preserving the single allocations for previously allocated models. Backport of #88855

... of elastic#88855

... of #88855

dimitris-athanasiou added >non-issue :ml Machine learning v8.4.0 v8.5.0 labels Jul 27, 2022

elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 27, 2022

mark-vieira removed the v8.4.0 label Jul 27, 2022

dimitris-athanasiou added the v8.4.1 label Jul 27, 2022

benwtrent reviewed Jul 27, 2022

View reviewed changes

...rc/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlanner.java Outdated Show resolved Hide resolved

...rc/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlanner.java Show resolved Hide resolved

dimitris-athanasiou added 3 commits July 28, 2022 13:24

Some renaming for clarity

aa78147

Fix BWC serialization version as main is now 8.5

6c35bee

dimitris-athanasiou force-pushed the previously-assigned-models-should-get-at-least-one-allocation branch from d6d83b8 to 6c35bee Compare July 28, 2022 11:09

benwtrent approved these changes Jul 28, 2022

View reviewed changes

Temporary commit to enable testing a specific scenario

df49287

This disables the validation that we can fully allocate a model deployment on start-up. We want to test a specific scenario before merging the PR which that validation makes it much harder to test. Will revert this before merging.

dimitris-athanasiou added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jul 28, 2022

Merge branch 'main' into previously-assigned-models-should-get-at-lea…

7e49ddc

…st-one-allocation

Merge branch 'main' into previously-assigned-models-should-get-at-lea…

022b9d4

…st-one-allocation

elasticmachine and others added 2 commits August 1, 2022 22:53

Merge branch 'main' into previously-assigned-models-should-get-at-lea…

2b3d52a

…st-one-allocation

Revert "Temporary commit to enable testing a specific scenario"

5b6bdf3

This reverts commit df49287.

dimitris-athanasiou merged commit 735f7d1 into elastic:main Aug 3, 2022

dimitris-athanasiou deleted the previously-assigned-models-should-get-at-least-one-allocation branch August 3, 2022 09:48

dimitris-athanasiou mentioned this pull request Aug 3, 2022

[8.4][ML] Previously assigned models should get at least one allocati… #89068

Merged

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Aug 3, 2022

[ML] Mute model deployment rolling upgrade tests for backport

0fdf2cd

... of elastic#88855

dimitris-athanasiou mentioned this pull request Aug 3, 2022

[ML] Mute model deployment rolling upgrade tests for backport #89069

Merged

dimitris-athanasiou added a commit that referenced this pull request Aug 3, 2022

[ML] Mute model deployment rolling upgrade tests for backport (#89069)

0d38d10

... of #88855

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Aug 3, 2022

[ML] Adjust assignment serialization versions after backport

5a9984c

... of elastic#88855

dimitris-athanasiou mentioned this pull request Aug 3, 2022

[ML] Adjust assignment serialization versions after backport #89071

Merged

dimitris-athanasiou added a commit that referenced this pull request Aug 3, 2022

[ML] Adjust assignment serialization versions after backport (#89071)

3708ca6

... of #88855

mark-vieira added v8.4.0 and removed v8.4.1 labels Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Previously assigned models should get at least one allocation #88855

[ML] Previously assigned models should get at least one allocation #88855

dimitris-athanasiou commented Jul 27, 2022

elasticsearchmachine commented Jul 27, 2022

dimitris-athanasiou commented Jul 28, 2022

benwtrent left a comment

dimitris-athanasiou commented Jul 28, 2022 •

edited

Loading

benwtrent commented Jul 28, 2022

dimitris-athanasiou commented Jul 29, 2022

dimitris-athanasiou commented Aug 1, 2022

dimitris-athanasiou commented Aug 2, 2022

[ML] Previously assigned models should get at least one allocation #88855

[ML] Previously assigned models should get at least one allocation #88855

Conversation

dimitris-athanasiou commented Jul 27, 2022

elasticsearchmachine commented Jul 27, 2022

dimitris-athanasiou commented Jul 28, 2022

benwtrent left a comment

Choose a reason for hiding this comment

dimitris-athanasiou commented Jul 28, 2022 • edited Loading

benwtrent commented Jul 28, 2022

dimitris-athanasiou commented Jul 29, 2022

dimitris-athanasiou commented Aug 1, 2022

dimitris-athanasiou commented Aug 2, 2022

dimitris-athanasiou commented Jul 28, 2022 •

edited

Loading