Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot remove topology blocks once added #445

Closed
4 tasks done
shashwat-sec opened this issue Feb 24, 2022 · 9 comments
Closed
4 tasks done

Cannot remove topology blocks once added #445

shashwat-sec opened this issue Feb 24, 2022 · 9 comments
Labels
bug Something isn't working theme:topology

Comments

@shashwat-sec
Copy link

Readiness Checklist

  • I am running the latest version
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed
  • I am reporting the issue to the correct repository (for multi-repository projects)

Current Behavior

elasticsearch {
    dynamic "topology" {
      for_each = [for i in var.dr ? ["hot_content"] : ["coordinating", "hot_content", "master"]: {
      topology = i
      }]
      content {
        id                = topology.value.topology
        zone_count        = var.zone_count
        size              = topology.value.topology == "hot_content" ? var.data_size : topology.value.topology == "coordinating" ? var.coordinating_size : topology.value.topology == "master" ? var.master_size : null
      }
    }
  } 

I am trying to create a scaled down elastic cluster with just hot_content in single zone when var.dr is set to true, else it should create a full multi-zone cluster.
Creating the scaled down version is working fine. Scaling up by changing var.dr to false is also working fine.
But when i try to scale down the cluster again by setting var.dr to true, It tries to delete the hot_content block and modify coordinating block to hot_content.

Plan output:

~ elasticsearch {
            # (7 unchanged attributes hidden)

          ~ topology {
              ~ id                        = "coordinating" -> "hot_content"
                # (6 unchanged attributes hidden)
            }
          - topology {
              - config                    = [] -> null
              - id                        = "hot_content" -> null
              - instance_configuration_id = "gcp.data.highio.1" -> null
              - node_roles                = [
                  - "data_content",
                  - "data_hot",
                  - "remote_cluster_client",
                  - "transform",
                ] -> null
              - size                      = "1g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 1 -> null

              - autoscaling {
                  - max_size          = "128g" -> null
                  - max_size_resource = "memory" -> null
                }
            }
          - topology {
              - config                    = [] -> null
              - id                        = "master" -> null
              - instance_configuration_id = "gcp.master.1" -> null
              - node_roles                = [
                  - "master",
                  - "remote_cluster_client",
                ] -> null
              - size                      = "4g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 1 -> null
            }

            # (1 unchanged block hidden)
        }

Applying this plan errors out:

Error: failed updating deployment: 2 errors occurred:
│       * api error: clusters.cluster_invalid_plan: Cluster must contain at least a master topology element and a data topology element. 'master' node type is missing,'data' node type is missing,'master' node type exists in more than one topology element (resources.elasticsearch[0].cluster_topology)
│       * api error: deployments.elasticsearch.node_roles_error: Invalid node_roles configuration: The data roles in the plan must be the same as the data roles in the template [id = hot_content] (resources.elasticsearch[0])

I also tested by keeping the coordinating block as is and just try to delete master block.
Plan output shows fine:

 ~ elasticsearch {
            # (7 unchanged attributes hidden)

          - topology {
              - config                    = [] -> null
              - id                        = "master" -> null
              - instance_configuration_id = "gcp.master.1" -> null
              - node_roles                = [
                  - "master",
                  - "remote_cluster_client",
                ] -> null
              - size                      = "4g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 1 -> null
            }

            # (3 unchanged blocks hidden)
        }

But applying this plan errors out here:

Error: failed updating deployment: 1 error occurred:
│       * api error: clusters.cluster_invalid_plan: Cluster must contain at least a master topology element and a data topology element. 'master' node type is missing,'master' node type exists in more than one topology element (resources.elasticsearch[0].cluster_topology)

Expected Behavior

Ideally it should create/delete blocks as we add/remove those topology blocks from elasticsearch. This scaling up and down works fine using console, so it should follow similar behaviour here.

## Terraform definition

Steps to Reproduce

Terraform code and steps added above.

Possible Solution

Your Environment

  • Version used: 0.3.0
  • Running against Elastic Cloud SaaS or Elastic Cloud Enterprise and version: Elastic Cloud SaaS
  • Environment name and version (e.g. Go 1.9):
  • Server type and version:
  • Operating System and version:
  • Link to your project:
@shashwat-sec shashwat-sec added bug Something isn't working Team:Delivery labels Feb 24, 2022
@tobio
Copy link
Member

tobio commented Feb 27, 2022

This is a Terraform issue, and not specific to this provider. The order these resources are declared in the rendered Terraform definition files is important.

You can likely update your state definition to include the hot_content tier first, and the 'optional' elements second, i.e:

for_each = [for i in var.dr ? ["hot_content"] : ["coordinating", "hot_content", "master"]: {

becomes

for_each = [for i in var.dr ? ["hot_content"] : ["hot_content", "coordinating", "master"]: {

@shashwat-sec
Copy link
Author

@tobio I tried that as well. But somehow, it is expecting coordinating to be at the top.

@tobio
Copy link
Member

tobio commented Mar 16, 2022

@shashwat-sec sorry about the bad suggestion there. Digging into the code it looks like this is deeply tied into how these resources are managed.

We can look at fixing this behaviour, however that will take some time. An option which works right now would be to include all the expected topology elements, but set the size to 0 for the elements you don't want present in the deployment. Something like:

elasticsearch {
  topology {
    id = "hot_content"
    zone_count = var.zone_count
    size = var.data_size
  }

  topology {
    id = "coordinating"
    zone_count = var.zone_count
    size = var.dr ? 0 : var.coordinating_size
  }

  topology {
    id = "master"
    zone_count = var.zone_count
    size = var.dr ? 0 : var.master_size
  }
} 

@IanMoroney
Copy link

@tobio , I'm actually experiencing the same issue, but it appears to be worse than described.
In addition to the plan changing depending on whether you declare these topologies and set them to 0 or not, I also have two identical clusters described in the same way, and the topology order for each of them is different.

The staging plan wants to set the first declared topology from hot_content to cold, and the prod deployment wants to set the first topology back to hot_content, so I can't even describe the terraform resource in the same way for my staging and prod clusters.

I can only say that this urgently needs fixing so that the ordering isn't an issue, or expect the same order every time, and to make it consistent.

@tobio
Copy link
Member

tobio commented Jun 21, 2022

@IanMoroney can you include your resource/module definition and the value of any vars defined. Is autoscale=true in one of the prod/staging deployments?

this urgently needs fixing so that the ordering isn't an issue, or expect the same order every time, and to make it consistent.

Agreed, this behaviour is very frustrating. There's some ongoing investigation around solving this problem, but unfortunately there's no quick win so that ordering isn't an issue. The provider should already expect the same order every time, send me through the definition and we can understand what's going on and go from there.

@IanMoroney
Copy link

It is possible that you may not be able to replicate my exact scenario, as my two deployments of ES have wildly different ES versions.

main.tf


resource "ec_deployment" "search" {

  name = "${var.environment}-search"

  region                 = "azure-northeurope"
  version                = var.elasticsearch_version
  deployment_template_id = "azure-io-optimized"

  elasticsearch {
    topology {
      id            = "hot_content"
      size          = var.elasticsearch_size
      size_resource = "memory"
      zone_count    = 2
    }
  }

  kibana {}

}

variable "environment" {
  type        = string
  description = "The environment where resources are being provisioned. Mainly used as a name prefix."
}


variable "elasticsearch_version" {
  type        = string
  description = "The version of elasticsearch to provision the cluster."
}
variable "elasticsearch_size" {
  type        = string
  description = "The size of elasticsearch deployment."
}

In the staging environment (on the EC deployment itself), autoscaling is enabled. It is not currently defined in the terraform, and maybe that's contributing towards the confusion.
Prod doesn't have autoscaling enabled, which is

staging.tfvars

environment                     = "staging"
elasticsearch_version           = "7.16.1"
elasticsearch_size              = "1g"

prod.tfvars

environment                     = "prod"
elasticsearch_version           = "7.9.3"
elasticsearch_size              = "29g"

staging plan file:

Terraform will perform the following actions:

  # ec_deployment.b2c_search[0] will be updated in-place
  ~ resource "ec_deployment" "b2c_search" {
        id                     = "33f9f08f2a235c1cc51c6468dd549a7a"
        name                   = "staging-b2c-search"
        tags                   = {}
        # (6 unchanged attributes hidden)

      ~ elasticsearch {
            # (7 unchanged attributes hidden)

          ~ topology {
              ~ id                        = "cold" -> "hot_content"
              ~ size                      = "0g" -> "2g"
              ~ zone_count                = 1 -> 2
                # (4 unchanged attributes hidden)

                # (1 unchanged block hidden)
            }
          - topology {
              - config                    = [] -> null
              - id                        = "frozen" -> null
              - instance_configuration_id = "azure.es.datafrozen.lsv2" -> null
              - node_roles                = [
                  - "data_frozen",
                ] -> null
              - size                      = "0g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 1 -> null

              - autoscaling {
                  - max_size          = "120g" -> null
                  - max_size_resource = "memory" -> null
                }
            }
          - topology {
              - config                    = [] -> null
              - id                        = "hot_content" -> null
              - instance_configuration_id = "azure.data.highio.l32sv2" -> null
              - node_roles                = [
                  - "data_content",
                  - "data_hot",
                  - "ingest",
                  - "master",
                  - "remote_cluster_client",
                  - "transform",
                ] -> null
              - size                      = "2g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 2 -> null

              - autoscaling {
                  - max_size             = "29g" -> null
                  - max_size_resource    = "memory" -> null
                  - policy_override_json = jsonencode(
                        {
                          - proactive_storage = {
                              - forecast_window = "30 m"
                            }
                        }
                    ) -> null
                }
            }
          - topology {
              - config                    = [] -> null
              - id                        = "ml" -> null
              - instance_configuration_id = "azure.ml.d64sv3" -> null
              - node_roles                = [
                  - "ml",
                  - "remote_cluster_client",
                ] -> null
              - size                      = "0g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 1 -> null

              - autoscaling {
                  - max_size          = "60g" -> null
                  - max_size_resource = "memory" -> null
                  - min_size          = "0g" -> null
                  - min_size_resource = "memory" -> null
                }
            }
          - topology {
              - config                    = [] -> null
              - id                        = "warm" -> null
              - instance_configuration_id = "azure.data.highstorage.e16sv3" -> null
              - node_roles                = [
                  - "data_warm",
                  - "remote_cluster_client",
                ] -> null
              - size                      = "0g" -> null
              - size_resource             = "memory" -> null
              - zone_count                = 2 -> null

              - autoscaling {
                  - max_size          = "116g" -> null
                  - max_size_resource = "memory" -> null
                }
            }

            # (1 unchanged block hidden)
        }


        # (2 unchanged blocks hidden)
    }

prod plan file:

no changes need to be made

@IanMoroney
Copy link

So it does appear that autoscaling is the culprit here for me.
When autoscaling is enabled, it adds these phantom topologies which don't actually exist yet. If autoscaling is disabled, they all go away and the plan is happy.

The ordering issue is still apparent though when autoscaling is enabled, so I wonder if autoscaling forces a topology order?

@tobio
Copy link
Member

tobio commented Jun 22, 2022

@IanMoroney that's correct. When autoscaling is enabled, all topology elements which may be autoscaled into existence must be defined. We've recently merged a docs change with an updated example for exactly this scenario. Reading over those docs again, we should detail exactly which elements have a non-zero max_size by default (that's cold, frozen, hot_content, ml, and warm).

Lmk if you've got any feedback on those docs as well, there's always room for improvement.

@dimuon
Copy link
Contributor

dimuon commented Mar 1, 2023

Closed by #567

@dimuon dimuon closed this as completed Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working theme:topology
Projects
None yet
Development

No branches or pull requests

5 participants