Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: "iam:PassRole" defined in CFN to work properly in AWS China #6839

Merged
merged 3 commits into from
Sep 3, 2024

Conversation

artem-nefedov
Copy link
Contributor

@artem-nefedov artem-nefedov commented Aug 22, 2024

Fixes #6843

Description

This change fixes User <role> is not authorized to perform: iam:PassRole on resource... error in AWS China

How was this change tested?

Manually in aws-cn partition.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@artem-nefedov artem-nefedov requested a review from a team as a code owner August 22, 2024 09:06
@artem-nefedov artem-nefedov requested a review from njtran August 22, 2024 09:06
Copy link

netlify bot commented Aug 22, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 140f403
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/66d2e8d084ab820008c43589
😎 Deploy Preview https://deploy-preview-6839--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@artem-nefedov artem-nefedov changed the title Fix "iam:PassRole" defined in CFN to work properly in AWS China fix: "iam:PassRole" defined in CFN to work properly in AWS China Aug 22, 2024
@artem-nefedov
Copy link
Contributor Author

artem-nefedov commented Aug 22, 2024

It looks like this is not enough to fix the problem, which is weird.

Originally, with unmodified CFN I got to the point of NodeClaim being created, but stuck in unknown status due to not authorized to perform: iam:PassRole error. After manually editing IAM policy to include .cn suffix in the condition, NodeClaim got unstuck and everything worked fine after that.

Based on this result, I assumed that fix in CFN should work. However, after actually testing it from scratch with new CFN, I got result of EC2NodeClass stuck in unknown status, and controller contains the following in the log:

{"level":"ERROR","time":"2024-08-22T12:14:08.847Z","logger":"controller","message":"Reconciler error","commit":"5bdf9c3","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"system"},"namespace":"","name":"align-default","reconcileID":"cc596ee6-c7d7-4559-9911-a6a634de7507","error":"creating instance profile, adding role "KarpenterNodeRole-foo" to instance profile "foo_17739752741978510565", AccessDenied: User: arn:aws-cn:sts::redacted:assumed-role/eksctl-foo-addon-ia-Role1-rAgitn6Z4WLZ/1724327649811376400 is not authorized to perform: iam:PassRole on resource: arn:aws-cn:iam::redacted:role/KarpenterNodeRole-foo because no identity-based policy allows the iam:PassRole action\n\tstatus code: 403, request id: 96b8440a-6cd2-4cf7-8ec2-0a0eabab5d34"}

Is it possible that controller has some iam simulate policy calls with hardcoded "ec2.amazonaws.com" principal?

I now created an issue for this, since it turned out to be not as simple fix as I hoped.

@artem-nefedov
Copy link
Contributor Author

Update: completely removing condition block fixes the problem.
However, I'm not sure this is the desired solution.
If everyone is ok with such change, I can update the PR.

@jonathan-innis jonathan-innis force-pushed the cfn-fix-aws-cn branch 2 times, most recently from 7db25d2 to e1dba85 Compare August 22, 2024 16:52
jonathan-innis
jonathan-innis previously approved these changes Aug 22, 2024
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

@coveralls
Copy link

coveralls commented Aug 22, 2024

Pull Request Test Coverage Report for Build 10624166177

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.02%) to 83.025%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 91.67%
Totals Coverage Status
Change from base Build 10636160457: -0.02%
Covered Lines: 5512
Relevant Lines: 6639

💛 - Coveralls

@artem-nefedov
Copy link
Contributor Author

@jonathan-innis Please, check my last 2 comments before merge.
While the change is correct from CFN standpoint, the controller still encounters a problem in AWS China (I don't understand why), and I only managed to avoid the problem by removing the condition block completely.

@jonathan-innis
Copy link
Contributor

and I only managed to avoid the problem by removing the condition block completely

The principal that you are giving access to looks correct to me, so not sure why you would be running into issues with the PassRole. Please let me know but we can hold this PR until then

@jonathan-innis
Copy link
Contributor

After manually editing IAM policy to include .cn suffix in the condition, NodeClaim got unstuck and everything worked fine after that

It sounds like you were able to get this to work by just editing the condition key originally so I'm surprised if there's anything else that we need to do here.

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-910a4a4d3040567162eb108477d4f1d14d33e583.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-910a4a4d3040567162eb108477d4f1d14d33e583" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@artem-nefedov
Copy link
Contributor Author

After manually editing IAM policy to include .cn suffix in the condition, NodeClaim got unstuck and everything worked fine after that

It sounds like you were able to get this to work by just editing the condition key originally so I'm surprised if there's anything else that we need to do here.

Yes, but this implies that we're already at the stage where NodeClaim is created, and EC2NodeClass was created and reconciled before, which for me only works with old condition (without .cn suffix), or without a condition at all.
If I apply new CFN with .cn suffix first, the whole process fails much earlier because EC2NodeClass can't reconcile due to a similar permission error.

@jonathan-innis
Copy link
Contributor

which for me only works with old condition (without .cn suffix), or without a condition at all

That sounds odd to me -- I'm surprised that that's the case

@artem-nefedov
Copy link
Contributor Author

That sounds odd to me -- I'm surprised that that's the case

I also don't understand the reason, but those are the test results.
If you have an access to aws-cn partition, you can try yourself - should be quite simple to reproduce. If not, I can gather more data if you'll tell me what to look at.

@jonathan-innis
Copy link
Contributor

If you have an access to aws-cn partition, you can try yourself - should be quite simple to reproduce

Yeah, I was going to try and track this down. Don't have an account at this point, but we'll see what I can do :)

@jonathan-innis
Copy link
Contributor

Hmmm -- you're right. I was able to repro with the same policy and I'm getting authorization failures. I'm following up with some folks on the EC2 team to see if we can track-down what's going on here.

@jonathan-innis
Copy link
Contributor

Ok, so I think I figured out what you're running into here: This seems to be failing because one call (iam:AddRoleToInstanceProfile) expects one iam:PassedToService context key and another call (ec2:RunInstances) expects another. If I get the instance profile created with the role and then drop ec2.amazonaws.com, leaving just ec2.amazonaws.com.cn, everything just works. The problem is getting to that point. I had to craft my policy like below to get it to succeed always.

{
            "Sid": "AllowPassingInstanceRole",
            "Effect": "Allow",
            "Resource": "<>",
            "Action": "iam:PassRole",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": [
                        "ec2.amazonaws.com",
                        "ec2.amazonaws.com.cn"
                    ]
                }
            }
        },

Something seems weird here (this definitely isn't behavior I would expect) so I'm following-up with some folks to see if I can make some headway.

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Aug 29, 2024

Ok, confirmed with the EC2 folks -- this is just "the way it is" and won't be changed now so we have to code in a conditional into the CF code to make sure both SPs get added in China. @artem-nefedov Is this something that you want to add?

@jonathan-innis jonathan-innis self-assigned this Aug 29, 2024
@jonathan-innis
Copy link
Contributor

Potentially related: aws/aws-cdk#1282

@artem-nefedov
Copy link
Contributor Author

artem-nefedov commented Aug 29, 2024

@jonathan-innis I've updated PR accordingly. If it would make sense to also add service names from other AWS partitions (gov and such), I'm not aware what these are.

@artem-nefedov
Copy link
Contributor Author

@jonathan-innis My bad, I didn't notice the "conditional" part, let me change this

@artem-nefedov
Copy link
Contributor Author

artem-nefedov commented Aug 30, 2024

I use a mapping instead of a condition because there could be more than 2 partitions.

@jonathan-innis Check 441e058. Does this look correct to you? (other partitions may need to be added)

EDIT: Nope, doesn't work, throws a syntax error. Is it even possible to use functions inside json that's passed as a string? If not, I don't see how this can be done conditionally.

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Aug 30, 2024

Is it even possible to use functions inside json that's passed as a string

Ah, yeah, that's a good point -- we are doing a substitution at the top above that should allow us to replace everything in the string, but you would have to have the function resolve the value outside of that context.

The good news is that we are close to completely removing the need to "stringify" this whole thing. The whole reason we had to is because you can't template permission policies around keys for aws:RequestTag and aws:ResourceTag in YAML. With the addition of eks:eks-cluster-name, we should be able to remove scoping on kubernetes.io/cluster/<cluster-name> and replace it with eks:eks-cluster-name -- allowing us to use YAML.

I suspect that would also simplify the integration here as well.

cc: @jigisha620

@artem-nefedov
Copy link
Contributor Author

Would it be acceptable to put a hardcoded list for the time being in order to unblock AWS China users?
I don't think that would compromise security in AWS Global.

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jonathan-innis
Copy link
Contributor

Would it be acceptable to put a hardcoded list for the time being in order to unblock AWS China users

Yep, seems reasonable to me.

Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

github-actions bot commented Sep 3, 2024

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-140f403054f65610dfecfa7e030b888d12b86b5d.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-140f403054f65610dfecfa7e030b888d12b86b5d" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@jonathan-innis jonathan-innis merged commit c83c154 into aws:main Sep 3, 2024
40 of 42 checks passed
@artem-nefedov
Copy link
Contributor Author

@jonathan-innis for some reason, URLs for all releases (including new one v1.0.2) still return old policy without the list:
https://raw.githubusercontent.com/aws/karpenter-provider-aws/v1.0.2/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml
Even though documentation section here was changed to include the list:
https://karpenter.sh/v1.0/reference/cloudformation/#allowpassinginstancerole

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

iam:PassRole not working with provided CFN template in AWS China
3 participants