Update installation and troubleshooting documentation

kubernetes-sigs · May 19, 2023 · b27f31b · b27f31b
1 parent 7e68026
commit b27f31b
Show file tree

Hide file tree

Showing 3 changed files with 142 additions and 83 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -7,6 +7,10 @@ The [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) Container Storag
 
 ### Troubleshooting
 For help with troubleshooting, please refer to our [troubleshooting doc](https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/docs/troubleshooting.md).
+
+### Installation
+For installation and deployment instructions, please refer to our [installation doc](https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/docs/install.md)
+
 ### CSI Specification Compatibility Matrix
 | AWS FSx for Lustre CSI Driver \ CSI Version | v0.3.0 | v1.x.x |
 |---------------------------------------------|--------|--------|
@@ -78,87 +82,6 @@ The following sections are Kubernetes-specific. If you are a Kubernetes user, us
 **Notes**:
 * For dynamically provisioned volumes, only one subnet is allowed inside a storageclass's `parameters.subnetId`. This is a [limitation](https://docs.aws.amazon.com/fsx/latest/APIReference/API_CreateFileSystem.html#FSx-CreateFileSystem-request-SubnetIds) that is enforced by FSx for Lustre.
 
-### Installation
-#### Set up driver permission
-The driver requires IAM permission to talk to Amazon FSx for Lustre service to create/delete the filesystem on user's behalf. There are several methods to grant driver IAM permission:
-* Using secret object - create an IAM user with proper permission, put that user's credentials in [secret manifest](../deploy/kubernetes/secret.yaml) then deploy the secret.
-
-```sh
-curl https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/deploy/kubernetes/secret.yaml > secret.yaml
-# Edit the secret with user credentials
-kubectl apply -f secret.yaml
-```
-
-* Using worker node instance profile - grant all the worker nodes with proper permission by attach policy to the instance profile of the worker.
-
-```json
-{
-  "Version": "2012-10-17",
-  "Statement": [
-    {
-      "Effect": "Allow",
-      "Action": [
-        "iam:CreateServiceLinkedRole",
-        "iam:AttachRolePolicy",
-        "iam:PutRolePolicy"
-       ],
-      "Resource": "arn:aws:iam::*:role/aws-service-role/s3.data-source.lustre.fsx.amazonaws.com/*"
-    },
-    {
-      "Action":"iam:CreateServiceLinkedRole",
-      "Effect":"Allow",
-      "Resource":"*",
-      "Condition":{
-        "StringLike":{
-          "iam:AWSServiceName":[
-            "fsx.amazonaws.com"
-          ]
-        }
-      }
-    },
-    {
-      "Effect": "Allow",
-      "Action": [
-        "s3:ListBucket",
-        "fsx:CreateFileSystem",
-        "fsx:DeleteFileSystem",
-        "fsx:DescribeFileSystems",
-        "fsx:TagResource"
-      ],
-      "Resource": ["*"]
-    }
-  ]
-}
-```
-
-#### Deploy driver
-```sh
-kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-0.9"
-# Alternatively,, to pull from
-# public ECR (public.ecr.aws/fsx-csi-driver/aws-fsx-csi-driver) instead of
-# private ECR (602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-fsx-csi-driver):
-kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/ecr-public?ref=release-0.9"
-```
-
-Alternatively, you could also install the driver using helm:
-
-```sh
-helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver/
-helm repo update
-helm upgrade --install aws-fsx-csi-driver --namespace kube-system aws-fsx-csi-driver/aws-fsx-csi-driver
-```
-
-###### Upgrading from version release-0.4 to release-0.5 of the kustomize configuration
-
-In the master branch and the next release there are breaking changes that require you to `--force` to `kubectl apply`:
-```sh
-kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master" --force
-```
-
-##### Upgrading from version 0.x to 1.x of the helm chart
-
-Version 1.0.0 removed and renamed almost all values to be more consistent with the EBS and EFS CSI driver helm charts. For details, see the [CHANGELOG](./charts/aws-fsx-csi-driver/CHANGELOG.md).
-
 ### Examples
 Before the example, you need to:
 * Get yourself familiar with how to setup Kubernetes on AWS and [create FSx for Lustre filesystem](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started.html#getting-started-step1) if you are using static provisioning.

diff --git a/docs/install.md b/docs/install.md
@@ -0,0 +1,114 @@
+# Installation
+
+## Prerequisites
+
+* Kubernetes Version >= 1.20
+
+* If you are using a self managed cluster, ensure the flag `--allow-privileged=true` for `kube-apiserver`.
+
+* Important: If you intend to use the Volume Snapshot feature, the [Kubernetes Volume Snapshot CRDs](https://github.com/kubernetes-csi/external-snapshotter/tree/master/client/config/crd) must be installed **before** the FSx for OpenZFS CSI driver. For installation instructions, see [CSI Snapshotter Usage](https://github.com/kubernetes-csi/external-snapshotter#usage).
+
+## Installation
+### Set up driver permissions
+The driver requires IAM permissions to interact with the Amazon FSx for Lustre service to create/delete file systems and volumes on the user's behalf.
+There are several methods to grant the driver IAM permissions:
+* Using [IAM roles for ServiceAccounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) (**Recommended**) - Create a Kubernetes service account for the driver and attach the AmazonFSxFullAccess AWS-managed policy to it with the following command. If your cluster is in the AWS GovCloud Regions, then replace arn:aws: with arn:aws-us-gov. Likewise, if your cluster is in the AWS China Regions, replace arn:aws: with arn:aws-cn:
+```sh
+
+export cluster_name=my-csi-fsx-cluster
+export region_code=region-code
+
+eksctl create iamserviceaccount \
+    --name fsx-csi-controller-sa \
+    --namespace kube-system \
+    --cluster $cluster_name \
+    --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
+    --approve \
+    --role-name AmazonEKSFSxLustreCSIDriverFullAccess \
+    --region $region_code
+```
+
+* Using IAM [instance profile](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html) - Create the following IAM policy and attach the policy to the instance profile IAM role of your cluster's worker nodes.
+  See [here](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html) for guidelines on how to access your EKS node IAM role.
+```sh
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Action": [
+        "iam:CreateServiceLinkedRole",
+        "iam:AttachRolePolicy",
+        "iam:PutRolePolicy"
+       ],
+      "Resource": "arn:aws:iam::*:role/aws-service-role/s3.data-source.lustre.fsx.amazonaws.com/*"
+    },
+    {
+      "Action":"iam:CreateServiceLinkedRole",
+      "Effect":"Allow",
+      "Resource":"*",
+      "Condition":{
+        "StringLike":{
+          "iam:AWSServiceName":[
+            "fsx.amazonaws.com"
+          ]
+        }
+      }
+    },
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:ListBucket",
+        "fsx:CreateFileSystem",
+        "fsx:DeleteFileSystem",
+        "fsx:DescribeFileSystems",
+        "fsx:TagResource"
+      ],
+      "Resource": ["*"]
+    }
+  ]
+}
+```
+
+
+
+### Configure driver toleration settings
+By default, the driver controller tolerates taint `CriticalAddonsOnly` and has `tolerationSeconds` configured as `300`; and the driver node tolerates all taints.
+If you don't want to deploy the driver node on all nodes, please set Helm `Value.node.tolerateAllTaints` to false before deployment.
+Add policies to `Value.node.tolerations` to configure customized toleration for nodes.
+
+### Configure node startup taint
+There are potential race conditions on node startup (especially when a node is first joining the cluster) where pods/processes that rely on the FSx for Lustre CSI Driver can act on a node before the FSx for Lustre CSI Driver is able to start up and become fully ready. To combat this, the FSx for Lustre CSI Driver contains a feature to automatically remove a taint from the node on startup. Users can taint their nodes when they join the cluster and/or on startup, to prevent other pods from running and/or being scheduled on the node prior to the FSx for Lustre CSI Driver becoming ready.
+
+This feature is activated by default, and cluster administrators should use the taint `fsx.csi.aws.com/agent-not-ready:NoExecute` (any effect will work, but `NoExecute` is recommended). For example, EKS Managed Node Groups [support automatically tainting nodes](https://docs.aws.amazon.com/eks/latest/userguide/node-taints-managed-node-groups.html).
+
+### Deploy driver
+You may deploy the FSx for Lustre CSI driver via Kustomize or Helm
+
+#### Kustomize
+```sh
+kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
+```
+
+*Note: Using the master branch to deploy the driver is not supported as the master branch may contain upcoming features incompatible with the currently released stable version of the driver.*
+
+#### Helm
+- Add the `aws-fsx-csi-driver` Helm repository.
+```sh
+helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
+helm repo update
+```
+
+- Install the latest release of the driver.
+```sh
+helm upgrade --install aws-fsx-csi-driver \
+    --namespace kube-system \
+    aws-fsx-csi-driver/aws-fsx-csi-driver
+```
+
+Review the [configuration values](https://github.com/kubernetes-sigs/aws-fsx-openzfs-csi-driver/blob/master/charts/aws-fsx-csi-driver/values.yaml) for the Helm chart.
+
+#### Once the driver has been deployed, verify the pods are running:
+```sh
+kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-fsx-csi-driver
+```
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -15,11 +15,13 @@ For common Lustre issues, you can refer to the [AWS Lustre troubleshooting guide
 ##### Characteristics:
 
 1. The underlying file system has a large number of files
-2. When calling `kubectl get pod <pod-name>` you see an error message similar to this:
+2. Your configuration requires setting volume ownership, which can be verified by seeing `Setting volume ownership` in kubelet logs
+3. When calling `kubectl get pod <pod-name>` you see an error message similar to this:
 ```
 Warning  FailedMount  kubelet    Unable to attach or mount volumes: unmounted volumes=[fsx-volume-name], unattached volumes=[fsx-volume-name]: timed out waiting for the condition
 ```
 
+
 ##### Likely Cause:
 Volume ownership is being set recursively on every file in the volume, which prevents the pod from mounting the volume for an extended period of time. See https://github.com/kubernetes/kubernetes/issues/69699
 
@@ -28,7 +30,27 @@ Volume ownership is being set recursively on every file in the volume, which pre
 
 For more information on configuring securityContext, see https://kubernetes.io/docs/tasks/configure-pod-container/security-context/.
 
-
+
+#### Issue: Pod is stuck in ContainerCreating when trying to mount a volume.
+
+##### Characteristics:
+
+1. When calling kubectl `get pod <pod-name>` you see an error message similar to this:
+```
+   Warning  FailedMount  kubelet    Unable to attach or mount volumes: unmounted volumes=[fsx-volume-name], unattached volumes=[fsx-volume-name]: timed out waiting for the condition
+```
+2. In the kubelet logs you see an error message similar to this:
+```
+kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name fsx.csi.aws.com not found in the list of registered CSI drivers
+```
+
+
+##### Likely Cause:
+A transient race condition is occurring on node startup, where CSI RPC calls are being made before the csi driver is ready on the node.
+
+##### Mitigation:
+Refer to our [installation documentation](https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/docs/install.md#configure-node-startup-taint) for instructions on configuring the node startup taint.
+
 
 #### Issue: Pods fail to mount file system with the following error: