Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor]: Upgrade to support v18.X of community EKS module #91

Merged
merged 33 commits into from
Nov 17, 2022

Conversation

jrsdav
Copy link
Contributor

@jrsdav jrsdav commented Oct 17, 2022

Fixes:

Motivation

This Terraform module relies on a community driven AWS EKS module and has been steadily using version v17.X for the majority of our releases.

In the most recent version of the community EKS module, v18.X, significant breaking changes were introduced which prevented our ability to continue staying current with updates.

It's been a long desire to re-work this module to be compatible with v18.X of the community EKS module, which also presented a good opportunity for us to refine our configuration approach for StreamNative platform.

This PR brings in the necessary changes to make this module compatible with v18.X, along with significant (and opinionated) changes to refine our use and configuration of AWS EKS environments compatible with StreamNative.

Modifications

Default Behavior

The behavior of the module has changed significantly. Using a "Vanilla" configuration, the following will occur:

All Resources

  • The input additional_tags applies custom sets of AWS resource tags against all applicable resources created by this module.

Cluster addons (autoscaler, cert-manager, csi, external-dns, metrics-server, node-termination-handler):

  • The cluster addons will continue to be installed by default
  • Addons will create an IAM role & IAM Policy

Node Groups

  • Desired node group size is set to 0 (with the exception of the small node group, which is set to 1).
  • Root volume disk size now defaults to 100GiB
  • Node groups will be created for every availability zone and every instance class defined by the node_pool_instance_types default value (e.g. 3 AZs * 4 Instance Classes = 12 node groups)

Note: The node groups changed in this module will result in Terraform wanting to destroy existing node groups. Because of this, it is recommended that you remove node group references in existing configurations before upgrading to this version (you will have to manually migrate away from and decommission the legacy node groups). An upgrade guide will be provided with more details.

Modified Behavior

Here is a brief overview of what is now able to be modified as part of the module:

  • The (now legacy) use_runtime_mode input is sticking around for backwards compatibility. It's sole responsibility is for management of the EKS cluster IAM role in this module, as opposed to being created/managed in the community EKS module. Continue using it as true if done so previously.
  • The enable_bootstrap input controls all of the add-ons installed by this module, including Istio (unless overriden by specifying enable_istio = true). If enable_bootstrap is set to false, nothing will be installed on the cluster (this allows us to use ArgoCD for installing add-ons for StreamNative's own configurations)
  • When the create_iam_policies input is set to false, the module will attach an IAM policy to an add-on IRSA IAM role, rather than try to create one. It defaults to IAM policies named StreamNativeCloudRuntimePolicy and StreamNativeCloudLBPolicy, but can also have a custom policy ARN provided using the sncloud_services_iam_policy_arn and sncloud_services_lb_policy_arn inputs.

Removed

The following has been removed from the module:

Resources:

  • Calico
  • External Secrets
  • Functions Node Group
  • Custom CloudWatch Configurations

Sub-modules:

  • backup-resources (added directly in the main module)
  • managed-cloud (moved to github.com/streamnative/terraform-managed-cloud)
  • vault-resources (no longer used or needed)
  • tiered-storage-resources (added directly in the main module

Variables:

  • aws_partition
  • calico_* (multiple inputs)
  • cluster_log_kms_key_id
  • disk_encryption_kms_key_id (renamed to disk_encryption_kms_key_arn)
  • enable_aws_load_balancer_controller
  • enable_calico
  • enable_cert_manager (now installed with enable_bootstrap)
  • enable_cluster_autoscaler (now installed with enable_bootstrap)
  • enable_csi (now installed with enable_bootstrap)
  • enable_cert_manager (now installed with enable_bootstrap)
  • enable_cluster_autoscaler (now installed with enable_bootstrap)
  • enable_external_secrets
  • enable_external_dns (now installed with enable_bootstrap)
  • enable_func_pool_monitoring
  • enable_metrics_server (now installed with enable_bootstrap)
  • external_secrets_* (multiple values)
  • extra_node_pool_instance_types
  • func_pool_* (multiple values)
  • istio_network_loadbancer (now controlled with disable_public_pulsar_endpoint)
  • map_additional_aws_accounts
  • map_additional_iam_users
  • node_pool_ami_is_eks_optimized (renamed to node_pool_ebs_optimized)
  • wait_for_cluster_timeout

Outputs

  • cloudwatch_log_group_arn
  • node_groups (changed to eks_node_groups)
  • worker_iam_role_arn
  • worker_security_group_id (changed to eks_node_group_security_group_id)
  • worker_https_ingress_security_group_rule

Added

The following has been added to the module:

Resources:

  • Tiered Storage (used Pulsar topic offloading)
  • Velero (used for backups)

Variables:

  • s3_encryption_kms_key_arn
  • cluster_security_group_additional_rules
  • cluster_security_group_id
  • create_cluster_security_group
  • create_iam_policies
  • create_node_security_group
  • disable_public_pulsar_endpoint
  • disable_public_eks_endpoint
  • disk_encryption_kms_key_arn
  • enable_bootstrap
  • enable_sncloud_control_plane_access
  • node_security_group_additional_rules
  • node_security_group_id
  • node_pool_disk_iops
  • node_pool_ebs_optimized
  • velero_backup_schedule
  • velero_excluded_namespaces
  • velero_helm_chart_name
  • velero_helm_chart_repository
  • velero_helm_chart_version
  • velero_namespace
  • velero_plugin_version
  • velero_policy_arn
  • velero_settings

Outputs:

  • eks_cluster_endpoint
  • eks_cluster_platform_version
  • eks_node_group_iam_role_arn
  • eks_node_group_security_group_id
  • eks_node_groups
  • tiered_storage_s3_bucket_arn
  • velero_s3_bucket_arn

Verifying this change

These changes are being verified against existing (upgrade/migration) and new environments.

Documentation

Check the box below.

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

Todo

  • More testing around node security groups
  • Update README.md
  • Write migration guide doc

@jrsdav jrsdav requested a review from a team as a code owner October 17, 2022 17:22
@github-actions
Copy link
Contributor

@jrsdav:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@github-actions github-actions bot added the doc-info-missing This pr needs to mark a document option in description label Oct 17, 2022
velero.tf Show resolved Hide resolved
gengmao
gengmao previously approved these changes Oct 17, 2022
Copy link
Contributor

@gengmao gengmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jrsdav jrsdav changed the title Refactor vnext [refactor]: Upgrade to support v18.X of community EKS module Oct 17, 2022
@jrsdav jrsdav marked this pull request as draft October 19, 2022 16:37
@gengmao
Copy link
Contributor

gengmao commented Oct 19, 2022

@jrsdav are use_runtime_mode and create_iam_policies default to false for new pool members? Keep them (or set them to true) just for pool members provisioned before this refactor, correct?

@jrsdav
Copy link
Contributor Author

jrsdav commented Oct 21, 2022

@jrsdav are use_runtime_mode and create_iam_policies default to false for new pool members? Keep them (or set them to true) just for pool members provisioned before this refactor, correct?

@gengmao Good catch -- use_runtime_policy should be set to false since we want the parent EKS module to manage the cluster's IAM role for net-new clusters.

But the default behavior around create_iam_policies should be set to true to account for Platform/DIY users (otherwise the IAM roles will fail to provision correctly due to the IAM Policy StreamNativeCloudRuntimePolicy not being present).

@gengmao
Copy link
Contributor

gengmao commented Nov 1, 2022

@jrsdav does the new version of community aws eks module still self-manages the amazon vpc cni plugin? Is it possible to switch to aws-managed k8s add-ons to avoid issues like https://github.com/streamnative/eng-support-tickets/issues/102 (part of https://github.com/streamnative/cloud-ops/issues/264)

@jrsdav
Copy link
Contributor Author

jrsdav commented Nov 2, 2022

@jrsdav does the new version of community aws eks module still self-manages the amazon vpc cni plugin? Is it possible to switch to aws-managed k8s add-ons to avoid issues like streamnative/eng-support-tickets#102 (part of streamnative/cloud-ops#264)

@gengmao we will want to manage the plugin in self-managed mode as part of the bootstrap, as opposed to using the aws-managed plugin. This gives us more control over versions and allows for a cleaner adoption for existing self managed environments.

I have created an issue to track this as part of the ArgoCD work.

John Daniel Maguire and others added 2 commits November 4, 2022 12:42
@@ -151,21 +159,21 @@ resource "helm_release" "cluster_autoscaler" {
}
]
image = {
tag = lookup(local.k8s_to_autoscaler_version, var.cluster_version, "v1.20.1") # image.tag defaults to the version corresponding to var.cluster_version's default value and must manually be updated
tag = lookup(local.k8s_to_autoscaler_version, var.cluster_version, "v1.21.1") # image.tag defaults to the version corresponding to var.cluster_version's default value and must manually be updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should default to 1.22 or 1.23.

description = "Desired number of worker nodes in the node pool."
type = number
}

variable "node_pool_disk_size" {
default = 50
default = 100
description = "Disk size in GiB for worker nodes in the node pool. Defaults to 50."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description = "Disk size in GiB for worker nodes in the node pool. Defaults to 50."
description = "Disk size in GiB for worker nodes in the node pool. Defaults to 100."

@@ -138,7 +114,7 @@ variable "cluster_autoscaler_helm_chart_repository" {
}

variable "cluster_autoscaler_helm_chart_version" {
default = "9.19.2"
default = "9.21.0"
description = "Helm chart version for the cluster-autoscaler. Defaults to \"9.10.4\". See https://github.com/kubernetes/autoscaler/tree/master/charts/cluster-autoscaler for more details."
Copy link
Contributor

@jdmaguire jdmaguire Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get rid of these defaults to X in the description. When we are generates the docs, terraform-docs actually does put the default in the table.

Comment on lines +211 to 215
variable "disable_public_pulsar_endpoint" {
default = false
description = "Whether or not to make the Istio Gateway use a public facing or internal network load balancer. If set to \"true\", additional configuration is required in order to manage the cluster from the StreamNative console"
type = bool
}
Copy link
Contributor

@jdmaguire jdmaguire Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new variable name. disable_ublic_pulsar_endpoint is more description than the old name of istio_network_loadbancer

default = "gp3"
description = "Disk type for function worker nodes. Defaults to gp3."
variable "hosted_zone_id" {
default = "*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is * a good default?

@@ -109,11 +116,12 @@ locals {
"1.20" = "v1.20.1",
"1.21" = "v1.21.1",
"1.22" = "v1.22.1",
"1.23" = "v1.23.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"1.23" = "v1.23.0"
"1.23" = "v1.23.1"

The .1 is released. Should we use it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.24.0 and 1.25.0 is also released. Should we add them to the map?

@jrsdav jrsdav changed the title [refactor WIP]: Upgrade to support v18.X of community EKS module [refactor]: Upgrade to support v18.X of community EKS module Nov 9, 2022
@jrsdav jrsdav marked this pull request as ready for review November 9, 2022 20:24
@addisonj
Copy link
Contributor

I fixed the merge conflict

@jdmaguire jdmaguire merged commit 2413b7e into master Nov 17, 2022
@jdmaguire jdmaguire deleted the refactor-vnext branch November 17, 2022 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc This pr contains a document doc-required This pr needs a document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants