Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Previously assigned models should get at least one allocation #88855

Conversation

dimitris-athanasiou
Copy link
Contributor

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to TrainedModelAssignment called max_assigned_allocations which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the AssignmentPlanner so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 27, 2022
When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to `TrainedModelAssignment` called `max_assigned_allocations` which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the `AssignmentPlanner` so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.
@dimitris-athanasiou dimitris-athanasiou force-pushed the previously-assigned-models-should-get-at-least-one-allocation branch from d6d83b8 to 6c35bee Compare July 28, 2022 11:09
@dimitris-athanasiou
Copy link
Contributor Author

@benwtrent I have done some renaming which I hope makes things a bit clearer.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good for @wwang500 to confirm this satisfies the bug fix.

This disables the validation that we can fully allocate a model
deployment on start-up. We want to test a specific scenario before
merging the PR which that validation makes it much harder to test.
Will revert this before merging.
@dimitris-athanasiou dimitris-athanasiou added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jul 28, 2022
@dimitris-athanasiou
Copy link
Contributor Author

dimitris-athanasiou commented Jul 28, 2022

@wwang500 I have added a temporary commit to this PR where I disable the validation added in #88656 so you can repeat your test exactly as you did before. You can do so by using the docker image generated by the CI.

@benwtrent
Copy link
Member

@elasticmachine update branch

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou
Copy link
Contributor Author

This has now been tested. I'll proceed to merge and backport.

@dimitris-athanasiou dimitris-athanasiou merged commit 735f7d1 into elastic:main Aug 3, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the previously-assigned-models-should-get-at-least-one-allocation branch August 3, 2022 09:48
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Aug 3, 2022
…on (elastic#88855)

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to `TrainedModelAssignment` called `max_assigned_allocations` which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the `AssignmentPlanner` so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

Backport of elastic#88855
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Aug 3, 2022
dimitris-athanasiou added a commit that referenced this pull request Aug 3, 2022
…on (#88855) (#89068)

When for some reason ML nodes are replaced (cluster resize, upgrade, etc.),
it is possible that some models cannot be allocated at all. Then, while
the cluster is temporarily undersized, all cores are given for allocations
of the models that have survived. If those ML nodes return later, there may
be model deployments that were previously allocated that now do not get any
allocations. The reason is that our planner will try to preserve all current
allocations.

Operationally, this is not what serves best our users. Instead, as we are
already in a cluster that does not have enough resources to fully allocate
all model deployments, we should try to give at least one allocation to each
model that has previously been allocated.

In order to know a model has previously been allocated, this commit adds a field
to `TrainedModelAssignment` called `max_assigned_allocations` which records the
max number of allocations a deployment has received in its life. We can then use
this to establish whether a deployment has ever been allocated.

Finally, we modify the `AssignmentPlanner` so that after computing a plan we
check whether the plan gives at least one allocation to all previously allocated models.
If not, we then compute a plan that tries to give at least one allocation to each
previously allocated model. We can solve this just using bin-packing. Having that
plan we can invoke the planner one more time to optimize the rest of the allocations
whilst preserving the single allocations for previously allocated models.

Backport of #88855
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Aug 3, 2022
@mark-vieira mark-vieira added v8.4.0 and removed v8.4.1 labels Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants