Skip to content

Commit

Permalink
k8s 1.30, updates and fix scale from zero (#164)
Browse files Browse the repository at this point in the history
  • Loading branch information
tmisch authored Oct 1, 2024
2 parents f4d1766 + c5fa92d commit 7d7274f
Show file tree
Hide file tree
Showing 8 changed files with 56 additions and 19 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,7 @@ Encryption is enabled at all AWS resources that are created by Terraform:
| Name | Type |
|------|------|
| [aws_autoscaling_group_tag.execnodes](https://registry.terraform.io/providers/hashicorp/aws/5.37.0/docs/resources/autoscaling_group_tag) | resource |
| [aws_autoscaling_group_tag.execnodes_node-template_resources_ephemeral-storage](https://registry.terraform.io/providers/hashicorp/aws/5.37.0/docs/resources/autoscaling_group_tag) | resource |
| [aws_autoscaling_group_tag.gpuexecnodes](https://registry.terraform.io/providers/hashicorp/aws/5.37.0/docs/resources/autoscaling_group_tag) | resource |
| [aws_autoscaling_group_tag.gpuivsnodes](https://registry.terraform.io/providers/hashicorp/aws/5.37.0/docs/resources/autoscaling_group_tag) | resource |
| [aws_cloudwatch_log_group.flowlogs](https://registry.terraform.io/providers/hashicorp/aws/5.37.0/docs/resources/cloudwatch_log_group) | resource |
Expand Down Expand Up @@ -548,7 +549,7 @@ Encryption is enabled at all AWS resources that are created by Terraform:
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_cloudwatch_retention"></a> [cloudwatch\_retention](#input\_cloudwatch\_retention) | Global cloudwatch retention period for the EKS, VPC, SSM, and PostgreSQL logs. | `number` | `7` | no |
| <a name="input_cluster_autoscaler_helm_config"></a> [cluster\_autoscaler\_helm\_config](#input\_cluster\_autoscaler\_helm\_config) | Cluster Autoscaler Helm Config | `any` | <pre>{<br> "version": "9.34.1"<br>}</pre> | no |
| <a name="input_cluster_autoscaler_helm_config"></a> [cluster\_autoscaler\_helm\_config](#input\_cluster\_autoscaler\_helm\_config) | Cluster Autoscaler Helm Config | `any` | `{}` | no |
| <a name="input_codemeter"></a> [codemeter](#input\_codemeter) | Download link for codemeter rpm package. | `string` | `"https://www.wibu.com/support/user/user-software/file/download/13346.html?tx_wibudownloads_downloadlist%5BdirectDownload%5D=directDownload&tx_wibudownloads_downloadlist%5BuseAwsS3%5D=0&cHash=8dba7ab094dec6267346f04fce2a2bcd"` | no |
| <a name="input_ecr_pullthrough_cache_rule_config"></a> [ecr\_pullthrough\_cache\_rule\_config](#input\_ecr\_pullthrough\_cache\_rule\_config) | Specifies if ECR pull through cache rule and accompanying resources will be created. Key 'enable' indicates whether pull through cache rule needs to be enabled for the cluster. When 'enable' is set to 'true', key 'exist' indicates whether pull through cache rule already exists for region's private ECR. If key 'enable' is set to 'true', IAM policy will be attached to the cluster's nodes. Additionally, if 'exist' is set to 'false', credentials for upstream registry and pull through cache rule will be created | <pre>object({<br> enable = bool<br> exist = bool<br> })</pre> | <pre>{<br> "enable": false,<br> "exist": false<br>}</pre> | no |
| <a name="input_enable_aws_for_fluentbit"></a> [enable\_aws\_for\_fluentbit](#input\_enable\_aws\_for\_fluentbit) | Install FluentBit to send container logs to CloudWatch. | `bool` | `false` | no |
Expand All @@ -568,7 +569,7 @@ Encryption is enabled at all AWS resources that are created by Terraform:
| <a name="input_ivsGpuNodeDiskSize"></a> [ivsGpuNodeDiskSize](#input\_ivsGpuNodeDiskSize) | The disk size in GiB of the nodes for the IVS gpu job execution | `number` | `100` | no |
| <a name="input_ivsGpuNodePool"></a> [ivsGpuNodePool](#input\_ivsGpuNodePool) | Specifies whether an additional node pool for IVS gpu job execution is added to the kubernetes cluster | `bool` | `false` | no |
| <a name="input_ivsGpuNodeSize"></a> [ivsGpuNodeSize](#input\_ivsGpuNodeSize) | The machine size of the GPU nodes for IVS jobs | `list(string)` | <pre>[<br> "g4dn.2xlarge"<br>]</pre> | no |
| <a name="input_kubernetesVersion"></a> [kubernetesVersion](#input\_kubernetesVersion) | The version of the EKS cluster. | `string` | `"1.28"` | no |
| <a name="input_kubernetesVersion"></a> [kubernetesVersion](#input\_kubernetesVersion) | The kubernetes version of the EKS cluster. | `string` | `"1.30"` | no |
| <a name="input_licenseServer"></a> [licenseServer](#input\_licenseServer) | Specifies whether a license server VM will be created. | `bool` | `false` | no |
| <a name="input_linuxExecutionNodeCountMax"></a> [linuxExecutionNodeCountMax](#input\_linuxExecutionNodeCountMax) | The maximum number of Linux nodes for the job execution | `number` | `10` | no |
| <a name="input_linuxExecutionNodeCountMin"></a> [linuxExecutionNodeCountMin](#input\_linuxExecutionNodeCountMin) | The minimum number of Linux nodes for the job execution | `number` | `0` | no |
Expand Down
15 changes: 14 additions & 1 deletion k8s.tf
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ module "eks-addons" {
dependency_update = true
}

cluster_autoscaler_helm_config = var.cluster_autoscaler_helm_config
cluster_autoscaler_helm_config = merge(local.cluster_autoscaler_helm_config, var.cluster_autoscaler_helm_config)
#depends_on = [module.eks.managed_node_groups]
}

Expand Down Expand Up @@ -72,6 +72,19 @@ resource "aws_autoscaling_group_tag" "execnodes" {
}
}

# see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup
# https://github.com/kubernetes/autoscaler/issues/1869#issuecomment-518530724
resource "aws_autoscaling_group_tag" "execnodes_node-template_resources_ephemeral-storage" {
autoscaling_group_name = data.aws_eks_node_group.execnodes.resources[0].autoscaling_groups[0].name

tag {
key = "k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage"
value = "${var.linuxExecutionNodeDiskSize}G"

propagate_at_launch = true
}
}

resource "aws_autoscaling_group_tag" "gpuexecnodes" {
count = var.gpuNodePool ? 1 : 0
autoscaling_group_name = data.aws_eks_node_group.gpuexecnodes[0].resources[0].autoscaling_groups[0].name
Expand Down
26 changes: 26 additions & 0 deletions locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,32 @@ locals {
# Using a one-line command for gpuPostUserData to avoid issues due to different line endings between Windows and Linux.
gpuPostUserData = "sudo yum -y erase nvidia-driver \nsudo yum -y install make gcc \nsudo yum -y update \nsudo yum -y install gcc kernel-devel-$(uname -r) \nsudo curl -fSsl -O https://us.download.nvidia.com/tesla/${var.gpuNvidiaDriverVersion}/NVIDIA-Linux-x86_64-${var.gpuNvidiaDriverVersion}.run \nsudo chmod +x NVIDIA-Linux-x86_64*.run \nsudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-x86_64*.run -s --no-dkms --install-libglvnd \nsudo touch /etc/modprobe.d/nvidia.conf \necho \"options nvidia NVreg_EnableGpuFirmware=0\" | sudo tee --append /etc/modprobe.d/nvidia.conf \nsudo reboot"

# https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/8a06a6e7006e4bed5630bd49c7434d76c59e0b5e/modules/kubernetes-addons/variables.tf#L183
cluster_autoscaler_autodiscovery_tags = [
# Helm value array can only be fully replaced, there is no mechanism to just append values to the list of default tags.
# Thus, we manually add the default values from
# https://github.com/kubernetes/autoscaler/blob/19fe7aba7ec4007084ccea82221b8a52bac42b34/charts/cluster-autoscaler/values.yaml#L23
# here as well:
"k8s.io/cluster-autoscaler/enabled",
"k8s.io/cluster-autoscaler/${var.infrastructurename}",
# and now our additional value(s)
# see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup
"k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage"
]
cluster_autoscaler_helm_config = {
# NOTE: This version needs to be updated at least on kubernetes version changes (variables.tf: 'kubernetesVersion').
# See https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#releases
# to determine the correct version.
version = "9.37.0"

set = [
{
name = "autoDiscovery.tags"
value = "{${join(",", local.cluster_autoscaler_autodiscovery_tags)}}"
}
]
}

default_managed_node_pools = {
"default" = {
node_group_name = "default"
Expand Down
5 changes: 3 additions & 2 deletions modules/simphera_aws_instance/minio-storage.tf
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
resource "aws_iam_role" "minio_iam_role" {
name = "${local.instancename}-s3-role"
description = "IAM role for the MinIO service account"
tags = var.tags
#depends_on = [aws_iam_policy.minio_policy]
tags = var.tags
assume_role_policy = jsonencode({
"Version" : "2012-10-17",
"Statement" : [
Expand Down Expand Up @@ -37,7 +38,7 @@ resource "aws_iam_policy" "minio_policy" {

resource "aws_iam_role" "executor_role" {
name = "${var.name}-executoragentlinux"

#depends_on = [ aws_iam_policy.minio_policy ]
# Terraform's "jsonencode" function converts a
# Terraform expression result to valid JSON syntax.
assume_role_policy = jsonencode({
Expand Down
4 changes: 2 additions & 2 deletions modules/simphera_aws_instance/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,13 @@ variable "postgresqlMaxStorageKeycloak" {
variable "db_instance_type_keycloak" {
type = string
description = "PostgreSQL database instance type for Keycloak data"
default = "db.t3.large"
default = "db.t4g.large"
}

variable "db_instance_type_simphera" {
type = string
description = "PostgreSQL database instance type for SIMPHERA data"
default = "db.t3.large"
default = "db.t4g.large"
}

variable "k8s_namespace" {
Expand Down
6 changes: 2 additions & 4 deletions terraform.json.example
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
{
"cloudwatch_retention": 7,
"cluster_autoscaler_helm_config": {
"version": "9.34.1"
},
"cluster_autoscaler_helm_config": {},
"codemeter": "https://www.wibu.com/support/user/user-software/file/download/13346.html?tx_wibudownloads_downloadlist%5BdirectDownload%5D=directDownload&tx_wibudownloads_downloadlist%5BuseAwsS3%5D=0&cHash=8dba7ab094dec6267346f04fce2a2bcd",
"ecr_pullthrough_cache_rule_config": {
"enable": false,
Expand Down Expand Up @@ -31,7 +29,7 @@
"ivsGpuNodeSize": [
"g4dn.2xlarge"
],
"kubernetesVersion": "1.28",
"kubernetesVersion": "1.30",
"licenseServer": false,
"linuxExecutionNodeCountMax": 10,
"linuxExecutionNodeCountMin": 0,
Expand Down
8 changes: 3 additions & 5 deletions terraform.tfvars.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
cloudwatch_retention = 7

# Cluster Autoscaler Helm Config
cluster_autoscaler_helm_config = {
"version": "9.34.1"
}
cluster_autoscaler_helm_config = {}

# Download link for codemeter rpm package.
codemeter = "https://www.wibu.com/support/user/user-software/file/download/13346.html?tx_wibudownloads_downloadlist%5BdirectDownload%5D=directDownload&tx_wibudownloads_downloadlist%5BuseAwsS3%5D=0&cHash=8dba7ab094dec6267346f04fce2a2bcd"
Expand Down Expand Up @@ -81,8 +79,8 @@ ivsGpuNodeSize = [
"g4dn.2xlarge"
]

# The version of the EKS cluster.
kubernetesVersion = "1.28"
# The kubernetes version of the EKS cluster.
kubernetesVersion = "1.30"

# Specifies whether a license server VM will be created.
licenseServer = false
Expand Down
6 changes: 3 additions & 3 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,8 @@ variable "codemeter" {

variable "kubernetesVersion" {
type = string
description = "The version of the EKS cluster."
default = "1.28"
description = "The kubernetes version of the EKS cluster."
default = "1.30"
}

variable "vpcId" {
Expand Down Expand Up @@ -329,5 +329,5 @@ variable "cloudwatch_retention" {
variable "cluster_autoscaler_helm_config" {
type = any
description = "Cluster Autoscaler Helm Config"
default = { "version" : "9.34.1" }
default = {}
}

0 comments on commit 7d7274f

Please sign in to comment.