Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E2E] Zero csi driver aws credentials to fallback to use instance profile role #4260

Closed
wyike opened this issue May 10, 2023 · 4 comments · Fixed by #4262
Closed

[E2E] Zero csi driver aws credentials to fallback to use instance profile role #4260

wyike opened this issue May 10, 2023 · 4 comments · Fixed by #4262
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@wyike
Copy link
Contributor

wyike commented May 10, 2023

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

What did you expect to happen:
It should be removed. After it is removed, the csi add on will fallback to workload cluster control plane instance role to get credentials from metadata service. Otherwise we cannot catch bugs on this basic scenario if using explict aws credentails

It is also not existing in the original csi addon test. At that time, it uses IMDSv1 to retrieve credentials.
After the #4147, the IMDSv2 is enabled however the hop limit is set to 1, csi addon is failed to retrieve credentials hence at that time #4147 add explict aws credentails in the yaml to let tests pass.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api-provider-aws version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 10, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wyike
Copy link
Contributor Author

wyike commented May 10, 2023

Zero the credentials causes two test cases failing to start statefulset pod after k8s version upgrade:

"CSI=external CCM=external AWSCSIMigration=on: upgrade to v1.23"
"CSI=external CCM=in-tree AWSCSIMigration=on: upgrade to v1.23"

will investigate it.

@wyike
Copy link
Contributor Author

wyike commented May 10, 2023

/assign

@wyike
Copy link
Contributor Author

wyike commented May 10, 2023

stateful set pod fails:

May 10 10:17:46 ip-10-0-116-133 kubelet[1560]: E0510 10:17:46.361832    1560 pod_workers.go:951] "Error syncing pod, skipping" err="unmounted volumes=[intree-volumes], unattached volumes=[intree-volumes -api-access-phhcv]: timed out waiting for the condition"kube pod="default/intree-nginx-statefulset-0" podUID=44d2a582-d343-4f41-aa37-e39a5ceba5b6

csi controller log:

E0510 10:33:56.111109       1 driver.go:120] "GRPC error" err=<
    rpc error: code = Internal desc = Could not attach volume "vol-0475f64ff83e38dd9" to node "i-021bdeb1a8dbf7694": could not attach volume "vol-0475f64ff83e38dd9" to node "i-021bdeb1a8dbf7694": UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: QZ9VYB_X6KMr32CxKJ62QakiNeyAEaZJJWew3S0ZQIP8JpyAA0-iOCjtpwpyNc7PAMAid8NqTp5FMnPtpHiwVXK8gaHWflk8qDk02i38qYWf7FLlL7ol4oQvj8OCI7uHDDdWj48HeOYu1zkv3_jU4ffiYaRLMbjLMzZyujFRi7ki0P-qWS02hFoEwvNeBoRfBP9rta1c0Dn3NShaHXj35gJETQKqjjIVqJWUQwDHHAS6rdbzEvC-_aFoJeTIiTX17Vc5h10fE_9AO73QrMfTakT4vKTTwJEDigQGdQWGsUYX2RNhOlZhbDUuQC5dK9Jm_oPwi-e73QkAmqE8qHih_uNODrbWKWpOtb-7ihrD7cTapAFHW9EZT4SM3oGjm3Zc-UTP8DcfaIYOuwlzr1uQVzM-lFcwFm9r0j2Fmd952eF9vXtDWkubWmlQoiVo1UqUWHaDWID91A1cqedapXRyavB5bpvHtXOYKyZlH4lT3-ZBv9ycMyjuwW3BcfHBLwsuDINrZsGUKdI2BlF2sfIlSDIFJ_tWssxbvK4nnyuLOGVjBWpyJFnNe4q6KsU8CBq5id_Nnp80YmvrHYMHSoynTOfX4yYPOyR5HRdVXUnjhMpS9yyI0RVpnBjnUso-PWSxjmdQ10oIUggZyBDOQk9LuPExZkudfBKCNK1i786wSscGR1vsPCagN0_tGZzuLz0_OT1rNYqryb8DoTvpmCI9iBhSLuFFyNxxAuNIKSOSN4-Gf_ZpPfCXgq06zkv42yAXUodQaFMH2VjqtYRkHrP_vIfN6rPY2oDJTV2bkOfBio3O3JkBiht3MZBbOOehTcwkIlQ0CwQvEhbD5k3gCVd_bPtU8TRHNydP1clLkjsyb-HG7bwdYObbAxug3HYrRBAzSZLxwqu0cPcnn6pHSWqDHh3s_aHqsuGSGmegq9Y
        status code: 403, request id: 4db6b54f-e6f5-49c3-9ae7-4784bebc2345

decode it:

aws sts decode-authorization-message --encoded-message QZ9VYB_X6KMr32CxKJ62QakiNeyAEaZJJWew3S0ZQIP8JpyAA0-iOCjtpwpyNc7PAMAid8NqTp5FMnPtpHiwVXK8gaHWflk8qDk02i38qYWf7FLlL7ol4oQvj8OCI7uHDDdWj48HeOYu1zkv3_jU4ffiYaRLMbjLMzZyujFRi7ki0P-qWS02hFoEwvNeBoRfBP9rta1c0Dn3NShaHXj35gJETQKqjjIVqJWUQwDHHAS6rdbzEvC-_aFoJeTIiTX17Vc5h10fE_9AO73QrMfTakT4vKTTwJEDigQGdQWGsUYX2RNhOlZhbDUuQC5dK9Jm_oPwi-e73QkAmqE8qHih_uNODrbWKWpOtb-7ihrD7cTapAFHW9EZT4SM3oGjm3Zc-UTP8DcfaIYOuwlzr1uQVzM-lFcwFm9r0j2Fmd952eF9vXtDWkubWmlQoiVo1UqUWHaDWID91A1cqedapXRyavB5bpvHtXOYKyZlH4lT3-ZBv9ycMyjuwW3BcfHBLwsuDINrZsGUKdI2BlF2sfIlSDIFJ_tWssxbvK4nnyuLOGVjBWpyJFnNe4q6KsU8CBq5id_Nnp80YmvrHYMHSoynTOfX4yYPOyR5HRdVXUnjhMpS9yyI0RVpnBjnUso-PWSxjmdQ10oIUggZyBDOQk9LuPExZkudfBKCNK1i786wSscGR1vsPCagN0_tGZzuLz0_OT1rNYqryb8DoTvpmCI9iBhSLuFFyNxxAuNIKSOSN4-Gf_ZpPfCXgq06zkv42yAXUodQaFMH2VjqtYRkHrP_vIfN6rPY2oDJTV2bkOfBio3O3JkBiht3MZBbOOehTcwkIlQ0CwQvEhbD5k3gCVd_bPtU8TRHNydP1clLkjsyb-HG7bwdYObbAxug3HYrRBAzSZLxwqu0cPcnn6pHSWqDHh3s_aHqsuGSGmegq9Y | sed 's/\\"/"/g' | sed 's/^"//' | sed 's/"$//'
{
    "DecodedMessage": "{"allowed":false,"explicitDeny":false,"matchedStatements":{"items":[]},"failures":{"items":[]},"context":{"principal":{"id":"AROAYHXN32BLFH57QXONL:i-021bdeb1a8dbf7694","arn":"arn:aws:sts::566360789078:assumed-role/nodes.cluster-api-provider-aws.sigs.k8s.io/i-021bdeb1a8dbf7694"},"action":"ec2:AttachVolume","resource":"arn:aws:ec2:us-east-2:566360789078:volume/vol-0475f64ff83e38dd9","conditions":{"items":[{"key":"566360789078:kubernetes.io/created-for/pvc/name","values":{"items":[{"value":"intree-volumes-intree-nginx-statefulset-0"}]}},{"key":"566360789078:kubernetes.io/cluster/csi-ccm-external-upgrade-bmxkno","values":{"items":[{"value":"owned"}]}},{"key":"aws:Resource","values":{"items":[{"value":"volume/vol-0475f64ff83e38dd9"}]}},{"key":"aws:Account","values":{"items":[{"value":"566360789078"}]}},{"key":"566360789078:kubernetes.io/created-for/pvc/namespace","values":{"items":[{"value":"default"}]}},{"key":"566360789078:Name","values":{"items":[{"value":"csi-ccm-external-upgrade-bmxkno-dynamic-pvc-b7434eb3-2b72-427d-a23f-652b86ad2f12"}]}},{"key":"ec2:AvailabilityZone","values":{"items":[{"value":"us-east-2a"}]}},{"key":"ec2:Encrypted","values":{"items":[{"value":"false"}]}},{"key":"ec2:ResourceTag/kubernetes.io/cluster/csi-ccm-external-upgrade-bmxkno","values":{"items":[{"value":"owned"}]}},{"key":"ec2:ResourceTag/Name","values":{"items":[{"value":"csi-ccm-external-upgrade-bmxkno-dynamic-pvc-b7434eb3-2b72-427d-a23f-652b86ad2f12"}]}},{"key":"ec2:VolumeType","values":{"items":[{"value":"gp2"}]}},{"key":"ec2:ResourceTag/kubernetes.io/created-for/pv/name","values":{"items":[{"value":"pvc-b7434eb3-2b72-427d-a23f-652b86ad2f12"}]}},{"key":"aws:Region","values":{"items":[{"value":"us-east-2"}]}},{"key":"aws:Service","values":{"items":[{"value":"ec2"}]}},{"key":"ec2:VolumeID","values":{"items":[{"value":"vol-0475f64ff83e38dd9"}]}},{"key":"ec2:VolumeSize","values":{"items":[{"value":"4"}]}},{"key":"ec2:ResourceTag/kubernetes.io/created-for/pvc/namespace","values":{"items":[{"value":"default"}]}},{"key":"aws:Type","values":{"items":[{"value":"volume"}]}},{"key":"ec2:VolumeIOPS","values":{"items":[{"value":"100"}]}},{"key":"ec2:ResourceTag/kubernetes.io/created-for/pvc/name","values":{"items":[{"value":"intree-volumes-intree-nginx-statefulset-0"}]}},{"key":"ec2:Region","values":{"items":[{"value":"us-east-2"}]}},{"key":"aws:ARN","values":{"items":[{"value":"arn:aws:ec2:us-east-2:566360789078:volume/vol-0475f64ff83e38dd9"}]}},{"key":"566360789078:kubernetes.io/created-for/pv/name","values":{"items":[{"value":"pvc-b7434eb3-2b72-427d-a23f-652b86ad2f12"}]}}]}}}
}

nodes.cluster-api-provider-aws.sigs.k8s.io role didn't have permission to perform theec2:AttachVolume action on arn:aws:ec2:us-east-2:566360789078:volume/vol-0475f64ff83e38dd9, per https://repost.aws/knowledge-center/ec2-not-auth-launch.

Checking more on the document, we need to pin csi controller to control plane by:

IMPORTANT WARNING: The CRDs from the AWS EBS CSI driver and AWS external cloud provider gives issue while installing the respective controllers on the AWS Cluster, it doesn't allow statefulsets to create the volume on existing EC2 instance. We need the CSI controller deployment and CCM pinned to the control plane which has right permissions to create, attach and mount the volumes to EC2 instances. To achieve this, you should add the node affinity rules to the CSI driver controller deployment and CCM DaemonSet manifests.

tolerations:
- key: node-role.kubernetes.io/master
  effect: NoSchedule
- effect: NoSchedule
  key: node-role.kubernetes.io/control-plane 
affinity:
  nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
      - matchExpressions:
          - key: node-role.kubernetes.io/master
            operator: Exists

This csi addon upgrade change removes this part unfortunately. I'll add it back.

From this practice, I also learned that exposing aws credentials in the csi add on indeed is too open and dangerous (it allows the pod to do many things like attach/detach volumes on workers directly). We should limit the permission of pods (eg. csi add-on) by using instance profile role.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
2 participants