ML is causing a scale up when its actually requesting for a scale down #74709

benwtrent · 2021-06-29T15:56:27Z

Issue

Versions: 7.11-7.13

Fixed in: 7.14+

Due to poor estimations, it is possible that a scale down request accidentally requires a scale up.

Here is a response that epitomizes the scenario:

    "ml": {
      "required_capacity": {
        "node": {
          "memory": 2520765440
        },
        "total": {
          "memory": 2520765440
        }
      },
      "current_capacity": {
        "node": {
          "storage": 0,
          "memory": 2147483648
        },
        "total": {
          "storage": 0,
          "memory": 6442450944
        }
      },
      "current_nodes": [
        {
          "name": "instance-0000000099"
        },
        {
          "name": "instance-0000000100"
        },
        {
          "name": "instance-0000000101"
        }
      ],
      "deciders": {
        "ml": {
          "required_capacity": {
            "node": {
              "memory": 2520765440
            },
            "total": {
              "memory": 2520765440
            }
          },
          "reason_summary": "Requesting scale down as tier and/or node size could be smaller",
          "reason_details": {
            "waiting_analytics_jobs": [],
            "waiting_anomaly_jobs": [],
            "configuration": {},
            "perceived_current_capacity": {
              "node": {
                "memory": 2503160627
              },
              "total": {
                "memory": 6074310888
              }
            },
            "required_capacity": {
              "node": {
                "memory": 2520765440
              },
              "total": {
                "memory": 2520765440
              }
            },
            "reason": "Requesting scale down as tier and/or node size could be smaller"
          }
        }
      }
    }

Note how the current size is actually 2GB (2147483648), but ML's estimation is off due to rounding values inappropriately (2520765440). This actually caused a scale up instead of a scale down.

Work around

If you have an Elasticsearch version that suffers from this and the scenario occurs, it is possible to statically set the minimum and maximum autoscaling sizes for ML inside of elastic cloud.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-29T15:56:30Z

Pinging @elastic/ml-core (Team:ML)

elasticmachine · 2021-06-29T15:56:30Z

Pinging @elastic/es-distributed (Team:Distributed)

… and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

… and improve scaling size estimations (elastic#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: elastic#74709

…g down and improve scaling size estimations (#74691) (#74780) * [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

…ng down and improve scaling size estimations (#74691) (#74781) * [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

…ng down and improve scaling size estimations (#74691) (#74782) * [ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations (#74691) This commit addresses two problems: - Our memory estimations are not very exact. Consequently, its possible to request for too much or too little by a handful of KBs, while this is not a large issue in ESS, for custom tier sizes, it may be. - When scaling down, it was possible that part of the scale down was actually a scale up! This was due to some floating point rounding errors and poor estimations. Even though are estimations are better, it is best to NOT request higher resources in a scale down, no matter what. One of the ways we improve the calculation is during JVM size calculations. Instead of having the knot point be `2gb` it has been changed to `1.2gb`. This accounts for the "window of uncertainty" for JVM sizes. closes: #74709

benwtrent added >bug :ml Machine learning :Distributed Coordination/Autoscaling labels Jun 29, 2021

elasticmachine added Team:ML Meta label for the ML team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jun 29, 2021

benwtrent mentioned this issue Jun 29, 2021

[ML] prevent accidentally asking for more resources when scaling down and improve scaling size estimations #74691

Merged

benwtrent closed this as completed in #74691 Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML is causing a scale up when its actually requesting for a scale down #74709

ML is causing a scale up when its actually requesting for a scale down #74709

benwtrent commented Jun 29, 2021

elasticmachine commented Jun 29, 2021

elasticmachine commented Jun 29, 2021

ML is causing a scale up when its actually requesting for a scale down #74709

ML is causing a scale up when its actually requesting for a scale down #74709

Comments

benwtrent commented Jun 29, 2021

Issue

Work around

elasticmachine commented Jun 29, 2021

elasticmachine commented Jun 29, 2021