-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hpc benckmark to unit test, and add "capacity-reservation" flag to deployer #470
Conversation
var capacityReservationId string | ||
if opts.CapacityReservation { | ||
for _, cr := range capacityReservations.CapacityReservations { | ||
if *cr.InstanceType == opts.InstanceTypes[0] && cr.State == "active" && *cr.AvailableInstanceCount >= int32(opts.Nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to enforce that only a single instance type is passed to the deployer when --capacity-reservation
is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to use a filter that retrieves capacity reservations, rather than using a for loop. I think it’s acceptable not to enforce that only a single instance type is passed; if multiple types are provided, we can select whichever is available.
@@ -267,7 +284,7 @@ func (m *NodegroupManager) createUnmanagedNodegroupWithEFA(infra *Infrastructure | |||
}, | |||
{ | |||
ParameterKey: aws.String("SubnetIds"), | |||
ParameterValue: aws.String(infra.subnetsPrivate[0]), // this is load bearing! EFA requires a private subnet | |||
ParameterValue: aws.String(strings.Join(infra.subnetsPrivate, ",")), // this is load bearing! EFA requires a private subnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to change anything else for EFA cross-subnet to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first subnet in the list may be in a different az than the reserved capacity. We need to have a way to choose the az that matches the reserved capacity.
kubetest2/internal/deployers/eksapi/templates/unmanaged-nodegroup-efa.yaml
Show resolved
Hide resolved
@@ -201,6 +205,9 @@ Resources: | |||
DeleteOnTermination: true | |||
VolumeSize: !Ref NodeDiskSize | |||
VolumeType: gp2 | |||
CapacityReservationSpecification: | |||
CapacityReservationTarget: | |||
CapacityReservationId: !Ref CapacityReservationId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here, what happens if we don't have a capacityReservationId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the template to not specify the target when it is not set.
return fmt.Errorf("no subnet found for availability zone %s", az) | ||
} | ||
subnetId = *subnet.Subnets[0].SubnetId | ||
klog.Infof("Using capacity reservation: %s", capacityReservationId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do you want to also log the subnet Id? or it can be found from other routes?
if opts.CapacityReservation { | ||
capacityReservations, err := m.clients.EC2().DescribeCapacityReservations(context.TODO(), &ec2.DescribeCapacityReservationsInput{ | ||
Filters: []ec2types.Filter{ | ||
{ | ||
Name: aws.String("instance-type"), | ||
Values: opts.InstanceTypes, | ||
}, | ||
{ | ||
Name: aws.String("state"), | ||
Values: []string{"active"}, | ||
}, | ||
}, | ||
}) | ||
if err != nil { | ||
return fmt.Errorf("failed to describe capacity reservation") | ||
} | ||
var az string | ||
for _, cr := range capacityReservations.CapacityReservations { | ||
if *cr.AvailableInstanceCount >= int32(opts.Nodes) { | ||
capacityReservationId = *cr.CapacityReservationId | ||
az = *cr.AvailabilityZone | ||
break | ||
} | ||
} | ||
if capacityReservationId == "" { | ||
return fmt.Errorf("no capacity reservation found for instance type %s with %d nodes count", opts.InstanceTypes[0], opts.Nodes) | ||
} | ||
subnet, err := m.clients.EC2().DescribeSubnets(context.TODO(), &ec2.DescribeSubnetsInput{ | ||
Filters: []ec2types.Filter{ | ||
{ | ||
Name: aws.String("availability-zone"), | ||
Values: []string{az}, | ||
}, | ||
{ | ||
Name: aws.String("subnet-id"), | ||
Values: infra.subnetsPrivate, | ||
}, | ||
}, | ||
}) | ||
if err != nil { | ||
return fmt.Errorf("failed to describe subnet") | ||
} | ||
if subnet == nil || len(subnet.Subnets) == 0 { | ||
return fmt.Errorf("no subnet found for availability zone %s", az) | ||
} | ||
subnetId = *subnet.Subnets[0].SubnetId | ||
klog.Infof("Using capacity reservation: %s", capacityReservationId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved to a helper function like getSubnetWithCapacity
for better readability
var az string | ||
for _, cr := range capacityReservations.CapacityReservations { | ||
if *cr.AvailableInstanceCount >= int32(opts.Nodes) { | ||
capacityReservationId = *cr.CapacityReservationId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have capacity reserved cross multiple AZs? If yes, we need to extend this to a list as Carter mentioned, the EFA cross AZ communication is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-limits
EFA OS-bypass traffic can't cross Availability Zones, VPCs, or AWS accounts.
We are excited to announce that AWS now supports cross-subnet communication between Elastic Fabric Adapter (EFA) interfaces for Amazon EC2 instances within the same Availability Zone (AZ).
I thought EFA doesn't support cross AZ communication, and I saw the NCCL test fails on it.
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.