Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML is causing a scale up when its actually requesting for a scale down #74709

Closed
benwtrent opened this issue Jun 29, 2021 · 2 comments · Fixed by #74691
Closed

ML is causing a scale up when its actually requesting for a scale down #74709

benwtrent opened this issue Jun 29, 2021 · 2 comments · Fixed by #74691
Labels
>bug :Distributed Coordination/Autoscaling :ml Machine learning Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team

Comments

@benwtrent
Copy link
Member

Issue

Versions: 7.11-7.13

Fixed in: 7.14+

Due to poor estimations, it is possible that a scale down request accidentally requires a scale up.

Here is a response that epitomizes the scenario:

    "ml": {
      "required_capacity": {
        "node": {
          "memory": 2520765440
        },
        "total": {
          "memory": 2520765440
        }
      },
      "current_capacity": {
        "node": {
          "storage": 0,
          "memory": 2147483648
        },
        "total": {
          "storage": 0,
          "memory": 6442450944
        }
      },
      "current_nodes": [
        {
          "name": "instance-0000000099"
        },
        {
          "name": "instance-0000000100"
        },
        {
          "name": "instance-0000000101"
        }
      ],
      "deciders": {
        "ml": {
          "required_capacity": {
            "node": {
              "memory": 2520765440
            },
            "total": {
              "memory": 2520765440
            }
          },
          "reason_summary": "Requesting scale down as tier and/or node size could be smaller",
          "reason_details": {
            "waiting_analytics_jobs": [],
            "waiting_anomaly_jobs": [],
            "configuration": {},
            "perceived_current_capacity": {
              "node": {
                "memory": 2503160627
              },
              "total": {
                "memory": 6074310888
              }
            },
            "required_capacity": {
              "node": {
                "memory": 2520765440
              },
              "total": {
                "memory": 2520765440
              }
            },
            "reason": "Requesting scale down as tier and/or node size could be smaller"
          }
        }
      }
    }

Note how the current size is actually 2GB (2147483648), but ML's estimation is off due to rounding values inappropriately (2520765440). This actually caused a scale up instead of a scale down.

Work around

If you have an Elasticsearch version that suffers from this and the scenario occurs, it is possible to statically set the minimum and maximum autoscaling sizes for ML inside of elastic cloud.

@elasticmachine elasticmachine added Team:ML Meta label for the ML team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jun 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

benwtrent added a commit that referenced this issue Jun 30, 2021
… and improve scaling size estimations (#74691)

This commit addresses two problems:

 - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. 
 - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. 

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes.
 
closes: #74709
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Jun 30, 2021
… and improve scaling size estimations (elastic#74691)

This commit addresses two problems:

 - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be.
 - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what.

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes.

closes: elastic#74709
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Jun 30, 2021
… and improve scaling size estimations (elastic#74691)

This commit addresses two problems:

 - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be.
 - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what.

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes.

closes: elastic#74709
benwtrent added a commit that referenced this issue Jul 1, 2021
…g down and improve scaling size estimations (#74691) (#74780)

* [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691)

This commit addresses two problems:

 - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be.
 - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what.

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes.

closes: #74709
benwtrent added a commit that referenced this issue Jul 1, 2021
…ng down and improve scaling size estimations (#74691) (#74781)

* [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691)

This commit addresses two problems:

 - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be.
 - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what.

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes.

closes: #74709
benwtrent added a commit that referenced this issue Jul 1, 2021
…ng down and improve scaling size estimations (#74691) (#74782)

* [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691)

This commit addresses two problems:

 - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be.
 - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what.

One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes.

closes: #74709
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Autoscaling :ml Machine learning Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team
Projects
None yet
2 participants