fix(eks): AMI changes in managed SSM store param causes rolling update of ASG #9746
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This might be the PR with the highest explanation/code ratio i've ever made :)
When a value changes for an AMI in a managed SSM store parameter, it should not cause a replacement of the ASG nodes. The reasoning is that managed params can change over time with no control on the user's part. Because of this, the change will not be reflected in
cdk diff
and creates a situation where every deployment can potentially cause node replacement without notice.There are two scenarios in which the cluster interacts with an
AutoScalingGroup
addCapacity
When one uses
cluster.addCapacity
, we implicitly create anAutoScalingGroup
that uses either theBottleRocketImage
or theEksOptimizedImage
as the machine image, with no option to customize it. Both these images fetch their AMI's from a managed SSM parameter (/aws/service/eks/optimized-ami
or/aws/service/bottlerocket
). This means that we create the situation described above by default.aws-cdk/packages/@aws-cdk/aws-eks/lib/cluster.ts
Lines 779 to 785 in 5af718b
Seems like a more reasonable default in this case would be to use
UpdateType.NONE
instead ofUpdateType.RollingUpdate
.Note that in such a case, even if the user explicitly changes the machine image configuration (by specifying a different
machineImageType
), node replacement will not occur, even thoughcdk diff
will clearly show a configuration change.In any case, the
updateType
can always be explicitly passed to mitigate any issue caused by the default behavior.addAutoScalingGroup
When one uses
cluster.addAutoScalingGroup
, theAutoScalingGroup
is created by the user. The default value forupdateType
in theAutoScalingGroup
construct isUpdateType.NONE
, so unless the user explicitly configuredUpdateType.RollingUpdate
- node replacement should not occur.Having said that, when a user specifies
UpdateType.RollingUpdate
, its not super intuitive that this update might happen without any explicit configuration change, and in fact this is actually documented in the images that use SSM to fetch the API:aws-cdk/packages/@aws-cdk/aws-ec2/lib/machine-image.ts
Lines 216 to 226 in 5af718b
There is no way for us to selectively apply the update policy, we either dont use it at all, meaning intentional user changes won't replace nodes as well, or we use it for all, meaning implicit changes will cause it.
Ideally, we should consider moving away from using these managed SSM params in launch configurations, but that requires some additional investigation.
The PR simply suggests to remove the
UpdateType.RollingUpdate
default from theaddCapacity
method, as a form of balance between all the considerations mentioned above.Fixes #7273
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license