-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add pattern that demonstrates using ML capacity block reservati…
…on with self-managed node group (#1941)
- Loading branch information
1 parent
f9fca1d
commit 447cb5f
Showing
11 changed files
with
288 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
title: ML Capacity Block Reservation (CBR) | ||
--- | ||
|
||
{% | ||
include-markdown "../../patterns/ml-capacity-block/README.md" | ||
%} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# EKS w/ ML Capacity Block Reservation (CBR) | ||
|
||
This pattern demonstrates how to consume/utilize ML capacity block reservations (CBR) with Amazon EKS. The solution is comprised of primarily 2 components: | ||
|
||
!!! warning | ||
The use of self-managed node group(s) are required at this time to support capacity block reservations within EKS. This pattern will be updated to demonstrate EKS managed node groups once support has been implemented by the EKS service. | ||
|
||
1. The self-managed node group that will utilize the CBR should have the subnets provided to it restricted to the availability zone where the CBR has been allocated. For example - if the CBR is allocated to `us-west-2b`, the node group should only have subnet IDs provided to it that reside in `us-west-2b`. If the subnets that reside in other AZs are provided, its possible to encounter an error such as `InvalidParameterException: The following supplied instance types do not exist ...`. It is not guaranteed that this error will always be shown, and may appear random since the underlying autoscaling group(s) will provision nodes into different AZs at random. It will only occur when the underlying autoscaling group tries to provision instances into an AZ where capacity is not allocated and there is insufficient on-demand capacity for the desired instance type. | ||
|
||
2. The launch template utilized should specify the `instance_market_options` and `capacity_reservation_specification` arguments. This is how the CBR is utilized by the node group (i.e. - tells the autoscaling group to launch instances utilizing provided capacity reservation). | ||
|
||
<b>Links:</b> | ||
|
||
- [EKS - Capacity Blocks for ML](https://docs.aws.amazon.com/eks/latest/userguide/capacity-blocks.html) | ||
- [EC2 - Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) | ||
|
||
## Code | ||
|
||
```terraform hl_lines="53-93" | ||
{% include "../../patterns/ml-capacity-block/eks.tf" %} | ||
``` | ||
|
||
## Deploy | ||
|
||
See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started/#prerequisites) for the prerequisites and steps to deploy this pattern. | ||
|
||
## Destroy | ||
|
||
{% | ||
include-markdown "../../docs/_partials/destroy.md" | ||
%} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
################################################################################ | ||
# Required Input | ||
################################################################################ | ||
|
||
# See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-using.html | ||
# on how to obtain a ML capacity block reservation. Once acquired, you can provide | ||
# the reservation ID through this input to deploy the pattern | ||
variable "capacity_reservation_id" { | ||
description = "The ID of the ML capacity block reservation to use for the node group" | ||
type = string | ||
} | ||
|
||
################################################################################ | ||
# Cluster | ||
################################################################################ | ||
|
||
module "eks" { | ||
source = "terraform-aws-modules/eks/aws" | ||
version = "~> 20.9" | ||
|
||
cluster_name = local.name | ||
cluster_version = "1.29" | ||
|
||
# Give the Terraform identity admin access to the cluster | ||
# which will allow it to deploy resources into the cluster | ||
enable_cluster_creator_admin_permissions = true | ||
cluster_endpoint_public_access = true | ||
|
||
cluster_addons = { | ||
coredns = {} | ||
kube-proxy = {} | ||
vpc-cni = {} | ||
} | ||
|
||
# Add security group rules on the node group security group to | ||
# allow EFA traffic | ||
enable_efa_support = true | ||
|
||
vpc_id = module.vpc.vpc_id | ||
subnet_ids = module.vpc.private_subnets | ||
|
||
eks_managed_node_groups = { | ||
# This node group is for core addons such as CoreDNS | ||
default = { | ||
instance_types = ["m5.large"] | ||
|
||
min_size = 1 | ||
max_size = 2 | ||
desired_size = 2 | ||
} | ||
} | ||
|
||
# Note: ML capacity block reservations are only supported | ||
# on self-managed node groups at this time | ||
self_managed_node_groups = { | ||
odcr = { | ||
# The EKS AL2 GPU AMI provides all of the necessary components | ||
# for accelerated workloads w/ EFA | ||
ami_type = "AL2_x86_64_GPU" | ||
instance_type = "p5.48xlarge" | ||
|
||
pre_bootstrap_user_data = <<-EOT | ||
# Mount instance store volumes in RAID-0 for kubelet and containerd | ||
# https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0 | ||
/bin/setup-local-disks raid0 | ||
# Ensure only GPU workloads are scheduled on this node group | ||
export KUBELET_EXTRA_ARGS='--node-labels=vpc.amazonaws.com/efa.present=true,nvidia.com/gpu.present=true \ | ||
--register-with-taints=nvidia.com/gpu=true:NoSchedule' | ||
EOT | ||
|
||
min_size = 2 | ||
max_size = 2 | ||
desired_size = 2 | ||
|
||
# This will: | ||
# 1. Create a placement group to place the instances close to one another | ||
# 2. Ignore subnets that reside in AZs that do not support the instance type | ||
# 3. Expose all of the available EFA interfaces on the launch template | ||
enable_efa_support = true | ||
|
||
# ML capacity block reservation | ||
instance_market_options = { | ||
market_type = "capacity-block" | ||
} | ||
capacity_reservation_specification = { | ||
capacity_reservation_target = { | ||
capacity_reservation_id = var.capacity_reservation_id | ||
} | ||
} | ||
} | ||
} | ||
|
||
tags = local.tags | ||
} | ||
|
||
################################################################################ | ||
# Helm charts | ||
################################################################################ | ||
|
||
resource "helm_release" "nvidia_device_plugin" { | ||
name = "nvidia-device-plugin" | ||
repository = "https://nvidia.github.io/k8s-device-plugin" | ||
chart = "nvidia-device-plugin" | ||
version = "0.14.5" | ||
namespace = "nvidia-device-plugin" | ||
create_namespace = true | ||
wait = false | ||
|
||
values = [ | ||
<<-EOT | ||
affinity: | ||
nodeAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: 'nvidia.com/gpu.present' | ||
operator: In | ||
values: | ||
- 'true' | ||
EOT | ||
] | ||
} | ||
|
||
resource "helm_release" "aws_efa_device_plugin" { | ||
name = "aws-efa-k8s-device-plugin" | ||
repository = "https://aws.github.io/eks-charts" | ||
chart = "aws-efa-k8s-device-plugin" | ||
version = "v0.4.4" | ||
namespace = "kube-system" | ||
wait = false | ||
|
||
values = [ | ||
<<-EOT | ||
nodeSelector: | ||
vpc.amazonaws.com/efa.present: 'true' | ||
tolerations: | ||
- key: nvidia.com/gpu | ||
operator: Exists | ||
effect: NoSchedule | ||
EOT | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
terraform { | ||
required_version = ">= 1.3" | ||
|
||
required_providers { | ||
aws = { | ||
source = "hashicorp/aws" | ||
version = ">= 5.34" | ||
} | ||
helm = { | ||
source = "hashicorp/helm" | ||
version = ">= 2.9" | ||
} | ||
} | ||
|
||
# ## Used for end-to-end testing on project; update to suit your needs | ||
# backend "s3" { | ||
# bucket = "terraform-ssp-github-actions-state" | ||
# region = "us-west-2" | ||
# key = "e2e/ml-capacity-block/terraform.tfstate" | ||
# } | ||
} | ||
|
||
provider "aws" { | ||
region = local.region | ||
} | ||
|
||
provider "helm" { | ||
kubernetes { | ||
host = module.eks.cluster_endpoint | ||
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data) | ||
|
||
exec { | ||
api_version = "client.authentication.k8s.io/v1beta1" | ||
command = "aws" | ||
# This requires the awscli to be installed locally where Terraform is executed | ||
args = ["eks", "get-token", "--cluster-name", module.eks.cluster_name] | ||
} | ||
} | ||
} | ||
|
||
################################################################################ | ||
# Common data/locals | ||
################################################################################ | ||
|
||
data "aws_availability_zones" "available" {} | ||
|
||
locals { | ||
name = basename(path.cwd) | ||
region = "us-west-2" | ||
|
||
vpc_cidr = "10.0.0.0/16" | ||
azs = slice(data.aws_availability_zones.available.names, 0, 3) | ||
|
||
tags = { | ||
Blueprint = local.name | ||
GithubRepo = "github.com/aws-ia/terraform-aws-eks-blueprints" | ||
} | ||
} | ||
|
||
################################################################################ | ||
# Supporting Resources | ||
################################################################################ | ||
|
||
module "vpc" { | ||
source = "terraform-aws-modules/vpc/aws" | ||
version = "~> 5.0" | ||
|
||
name = local.name | ||
cidr = local.vpc_cidr | ||
|
||
azs = local.azs | ||
private_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 4, k)] | ||
public_subnets = [for k, v in local.azs : cidrsubnet(local.vpc_cidr, 8, k + 48)] | ||
|
||
enable_nat_gateway = true | ||
single_nat_gateway = true | ||
|
||
public_subnet_tags = { | ||
"kubernetes.io/role/elb" = 1 | ||
} | ||
|
||
private_subnet_tags = { | ||
"kubernetes.io/role/internal-elb" = 1 | ||
} | ||
|
||
tags = local.tags | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters