From c0c187c094ac45485a5bcbdddb3807b1fdb3c7f0 Mon Sep 17 00:00:00 2001 From: Alik Saring Date: Tue, 15 Feb 2022 14:42:22 -0500 Subject: [PATCH 1/2] Update Resilienct documentation for pod affinity and kubelet service failure --- content/docs/resiliency/_index.md | 8 ++++---- content/docs/resiliency/deployment.md | 7 +++++-- content/docs/resiliency/usecases.md | 2 ++ 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/content/docs/resiliency/_index.md b/content/docs/resiliency/_index.md index 6aaa5551fb..5bb7be6d28 100644 --- a/content/docs/resiliency/_index.md +++ b/content/docs/resiliency/_index.md @@ -17,7 +17,7 @@ For more background on the forced deletion of Pods in a StatefulSet, please visi CSM for Resiliency is designed to make Kubernetes Applications, including those that utilize persistent storage, more resilient to various failures. The first component of the Resiliency module is a pod monitor that is specifically designed to protect stateful applications from various failures. It is not a standalone application, but rather is deployed as a _sidecar_ to CSI (Container Storage Interface) drivers, in both the driver's controller pods and the driver's node pods. Deploying CSM for Resiliency as a sidecar allows it to make direct requests to the driver through the Unix domain socket that Kubernetes sidecars use to make CSI requests. -Some of the methods CSM for Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM for Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently, there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM for Resiliency to make more accurate determinations about the state of the system and its persistent volumes. +Some of the methods CSM for Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM for Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently, there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM for Resiliency to make more accurate determinations about the state of the system and its persistent volumes. CSM for Resiliency is designed to adhere to pod affinity settings of pods. Accordingly, CSM for Resiliency is adapted to and qualified with each CSI driver it is to be used with. Different storage systems have different nuances and characteristics that CSM for Resiliency must take into account. @@ -28,7 +28,7 @@ CSM for Resiliency provides the following capabilities: {{}} | Capability | PowerScale | Unity | PowerStore | PowerFlex | PowerMax | | - | :-: | :-: | :-: | :-: | :-: | -| Detect pod failures for the following failure types - Node failure, K8S Control Plane Network failure, Array I/O Network failure | no | yes | no | yes | no | +| Detect pod failures for the following failure types - Node failure, K8S Control Plane Network failure, K8S Control Plane failure, Array I/O Network failure | no | yes | no | yes | no | | Cleanup pod artifacts from failed nodes | no | yes | no | yes | no | | Revoke PV access from failed nodes | no | yes | no | yes | no | {{
}} @@ -38,7 +38,7 @@ CSM for Resiliency provides the following capabilities: {{}} | COP/OS | Supported Versions | |-|-| -| Kubernetes | 1.20, 1.21, 1.22 | +| Kubernetes | 1.20, 1.21, 1.22 1.23 | | Red Hat OpenShift | 4.8, 4.9 | | RHEL | 7.x, 8.x | | CentOS | 7.8, 7.9 | @@ -54,7 +54,7 @@ CSM for Resiliency provides the following capabilities: ## Supported CSI Drivers -CSM for Authorization supports the following CSI drivers and versions. +CSM for Resiliency supports the following CSI drivers and versions. {{
}} | Storage Array | CSI Driver | Supported Versions | | ------------- | ---------- | ------------------ | diff --git a/content/docs/resiliency/deployment.md b/content/docs/resiliency/deployment.md index 3710f604a1..34fd8c6bbd 100644 --- a/content/docs/resiliency/deployment.md +++ b/content/docs/resiliency/deployment.md @@ -29,6 +29,7 @@ podmon: - "--csisock=unix:/var/run/csi/csi.sock" - "--labelvalue=csi-vxflexos" - "--mode=controller" + - "--skipArrayConnectionValidation=false" - "--driver-config-params=/vxflexos-config-params/driver-config-params.yaml" node: args: @@ -55,7 +56,7 @@ To install CSM for Resiliency with the driver, the following changes are require | mode | Required | Must be set to "controller" for controller-podmon and "node" for node-podmon. | controller & node | | csisock | Required | This should be left as set in the helm template for the driver. For controller:
`-csisock=unix:/var/run/csi/csi.sock`
For node it will vary depending on the driver's identity:
`-csisock=unix:/var/lib/kubelet/plugins`
`/vxflexos.emc.dell.com/csi_sock` | controller & node | | leaderelection | Required | Boolean value that should be set true for controller and false for node. The default value is true. | controller & node | -| skipArrayConnectionValidation | Optional | Boolean value that if set to true will cause controllerPodCleanup to skip the validation that no I/O is ongoing before cleaning up the pod. | controller | +| skipArrayConnectionValidation | Optional | Boolean value that if set to true will cause controllerPodCleanup to skip the validation that no I/O is ongoing before cleaning up the pod. If set to true will cause controllerPodCleanup on K8S Control Plane failure (kubelet service down). | controller | | labelKey | Optional | String value that sets the label key used to denote pods to be monitored by CSM for Resiliency. It will make life easier if this key is the same for all driver types, and drivers are differentiated by different labelValues (see below). If the label keys are the same across all drivers you can do `kubectl get pods -A -l labelKey` to find all the CSM for Resiliency protected pods. labelKey defaults to "podmon.dellemc.com/driver". | controller & node | | labelValue | Required | String that sets the value that denotes pods to be monitored by CSM for Resiliency. This must be specific for each driver. Defaults to "csi-vxflexos" for CSI Driver for Dell EMC PowerFlex and "csi-unity" for CSI Driver for Dell EMC Unity | controller & node | | arrayConnectivityPollRate | Optional | The minimum polling rate in seconds to determine if the array has connectivity to a node. Should not be set to less than 5 seconds. See the specific section for each array type for additional guidance. | controller | @@ -79,6 +80,7 @@ podmon: - "-mode=controller" - "-arrayConnectivityPollRate=5" - "-arrayConnectivityConnectionLossThreshold=3" + - "--skipArrayConnectionValidation=false" - "--driver-config-params=/vxflexos-config-params/driver-config-params.yaml" node: args: @@ -104,6 +106,7 @@ podmon: - "-labelvalue=csi-unity" - "-driverPath=csi-unity.dellemc.com" - "-mode=controller" + - "--skipArrayConnectionValidation=false" - "--driver-config-params=/unity-config/driver-config-params.yaml" node: args: @@ -135,7 +138,7 @@ This is a list of parameters that can be adjusted for CSM for Resiliency: | PODMON_NODE_LOG_LEVEL | String | "debug" |Logging level for the node podmon sidecar. Standard values: 'info', 'error', 'warning', 'debug', 'trace' | | PODMON_ARRAY_CONNECTIVITY_POLL_RATE | Integer (>0) | 15 |An interval in seconds to poll the underlying array | | PODMON_ARRAY_CONNECTIVITY_CONNECTION_LOSS_THRESHOLD | Integer (>0) | 3 |A value representing the number of failed connection poll intervals before marking the array connectivity as lost | -| PODMON_SKIP_ARRAY_CONNECTION_VALIDATION | Boolean | false |Flag to disable the array connectivity check | +| PODMON_SKIP_ARRAY_CONNECTION_VALIDATION | Boolean | false |Flag to disable the array connectivity check, set to true for NoSchedule or NoExecute taint due to K8S Control Plane failure (kubelet failure) | Here is an example of the parameters: diff --git a/content/docs/resiliency/usecases.md b/content/docs/resiliency/usecases.md index bacefca590..daac595325 100644 --- a/content/docs/resiliency/usecases.md +++ b/content/docs/resiliency/usecases.md @@ -36,3 +36,5 @@ CSM for Resiliency's design is focused on detecting the following types of hardw 2. K8S Control Plane Network Failure. Control Plane Network Failure often has the same K8S failure signature (the node is tainted with NoSchedule or NoExecute). However, if there is a separate Array I/O interface, CSM for Resiliency can often detect that the Array I/O Network may be active even though the Control Plane Network is down. 3. Array I/O Network failure is detected by polling the array to determine if the array has a healthy connection to the node. The capabilities to do this vary greatly by array and communication protocol type (Fibre Channel, iSCSI, NFS, NVMe, or PowerFlex SDC IP protocol). By monitoring the Array I/O Network separately from the Control Plane Network, CSM for Resiliency has two different indicators of whether the node is healthy or not. + +4. K8S Control Plane Failure. Control Plane Failure is defined as failure of kubelet in a given node. K8S Control Plane failures are generally discovered by receipt of a Node event with a NoSchedule or NoExecute taint, or detection of such a taint when retrieving the Node via the K8S API. From dec9ee69acc87b4d6af6241cbb69e05edfd34296 Mon Sep 17 00:00:00 2001 From: Alik Saring Date: Fri, 18 Feb 2022 10:34:52 -0500 Subject: [PATCH 2/2] PR review changes --- content/docs/resiliency/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/resiliency/_index.md b/content/docs/resiliency/_index.md index 5bb7be6d28..1b985e52ba 100644 --- a/content/docs/resiliency/_index.md +++ b/content/docs/resiliency/_index.md @@ -38,7 +38,7 @@ CSM for Resiliency provides the following capabilities: {{
}} | COP/OS | Supported Versions | |-|-| -| Kubernetes | 1.20, 1.21, 1.22 1.23 | +| Kubernetes | 1.21, 1.22 1.23 | | Red Hat OpenShift | 4.8, 4.9 | | RHEL | 7.x, 8.x | | CentOS | 7.8, 7.9 |