Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Resiliency documentation for pod affinity and kubelet failure #141

Merged
merged 3 commits into from
Feb 21, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions content/docs/resiliency/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ For more background on the forced deletion of Pods in a StatefulSet, please visi

CSM for Resiliency is designed to make Kubernetes Applications, including those that utilize persistent storage, more resilient to various failures. The first component of the Resiliency module is a pod monitor that is specifically designed to protect stateful applications from various failures. It is not a standalone application, but rather is deployed as a _sidecar_ to CSI (Container Storage Interface) drivers, in both the driver's controller pods and the driver's node pods. Deploying CSM for Resiliency as a sidecar allows it to make direct requests to the driver through the Unix domain socket that Kubernetes sidecars use to make CSI requests.

Some of the methods CSM for Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM for Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently, there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM for Resiliency to make more accurate determinations about the state of the system and its persistent volumes.
Some of the methods CSM for Resiliency invokes in the driver are standard CSI methods, such as NodeUnpublishVolume, NodeUnstageVolume, and ControllerUnpublishVolume. CSM for Resiliency also uses proprietary calls that are not part of the standard CSI specification. Currently, there is only one, ValidateVolumeHostConnectivity that returns information on whether a host is connected to the storage system and/or whether any I/O activity has happened in the recent past from a list of specified volumes. This allows CSM for Resiliency to make more accurate determinations about the state of the system and its persistent volumes. CSM for Resiliency is designed to adhere to pod affinity settings of pods.

Accordingly, CSM for Resiliency is adapted to and qualified with each CSI driver it is to be used with. Different storage systems have different nuances and characteristics that CSM for Resiliency must take into account.

Expand All @@ -28,7 +28,7 @@ CSM for Resiliency provides the following capabilities:
{{<table "table table-striped table-bordered table-sm">}}
| Capability | PowerScale | Unity | PowerStore | PowerFlex | PowerMax |
| - | :-: | :-: | :-: | :-: | :-: |
| Detect pod failures for the following failure types - Node failure, K8S Control Plane Network failure, Array I/O Network failure | no | yes | no | yes | no |
| Detect pod failures for the following failure types - Node failure, K8S Control Plane Network failure, K8S Control Plane failure, Array I/O Network failure | no | yes | no | yes | no |
| Cleanup pod artifacts from failed nodes | no | yes | no | yes | no |
| Revoke PV access from failed nodes | no | yes | no | yes | no |
{{</table>}}
Expand All @@ -38,7 +38,7 @@ CSM for Resiliency provides the following capabilities:
{{<table "table table-striped table-bordered table-sm">}}
| COP/OS | Supported Versions |
|-|-|
| Kubernetes | 1.20, 1.21, 1.22 |
| Kubernetes | 1.21, 1.22 1.23 |
| Red Hat OpenShift | 4.8, 4.9 |
| RHEL | 7.x, 8.x |
| CentOS | 7.8, 7.9 |
Expand All @@ -54,7 +54,7 @@ CSM for Resiliency provides the following capabilities:

## Supported CSI Drivers

CSM for Authorization supports the following CSI drivers and versions.
shanmydell marked this conversation as resolved.
Show resolved Hide resolved
CSM for Resiliency supports the following CSI drivers and versions.
{{<table "table table-striped table-bordered table-sm">}}
| Storage Array | CSI Driver | Supported Versions |
| ------------- | ---------- | ------------------ |
Expand Down
7 changes: 5 additions & 2 deletions content/docs/resiliency/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ podmon:
- "--csisock=unix:/var/run/csi/csi.sock"
- "--labelvalue=csi-vxflexos"
- "--mode=controller"
- "--skipArrayConnectionValidation=false"
- "--driver-config-params=/vxflexos-config-params/driver-config-params.yaml"
node:
args:
Expand All @@ -55,7 +56,7 @@ To install CSM for Resiliency with the driver, the following changes are require
| mode | Required | Must be set to "controller" for controller-podmon and "node" for node-podmon. | controller & node |
| csisock | Required | This should be left as set in the helm template for the driver. For controller: <br> `-csisock=unix:/var/run/csi/csi.sock` <br> For node it will vary depending on the driver's identity: <br> `-csisock=unix:/var/lib/kubelet/plugins`<br>`/vxflexos.emc.dell.com/csi_sock` | controller & node |
| leaderelection | Required | Boolean value that should be set true for controller and false for node. The default value is true. | controller & node |
| skipArrayConnectionValidation | Optional | Boolean value that if set to true will cause controllerPodCleanup to skip the validation that no I/O is ongoing before cleaning up the pod. | controller |
| skipArrayConnectionValidation | Optional | Boolean value that if set to true will cause controllerPodCleanup to skip the validation that no I/O is ongoing before cleaning up the pod. If set to true will cause controllerPodCleanup on K8S Control Plane failure (kubelet service down). | controller |
| labelKey | Optional | String value that sets the label key used to denote pods to be monitored by CSM for Resiliency. It will make life easier if this key is the same for all driver types, and drivers are differentiated by different labelValues (see below). If the label keys are the same across all drivers you can do `kubectl get pods -A -l labelKey` to find all the CSM for Resiliency protected pods. labelKey defaults to "podmon.dellemc.com/driver". | controller & node |
| labelValue | Required | String that sets the value that denotes pods to be monitored by CSM for Resiliency. This must be specific for each driver. Defaults to "csi-vxflexos" for CSI Driver for Dell EMC PowerFlex and "csi-unity" for CSI Driver for Dell EMC Unity | controller & node |
| arrayConnectivityPollRate | Optional | The minimum polling rate in seconds to determine if the array has connectivity to a node. Should not be set to less than 5 seconds. See the specific section for each array type for additional guidance. | controller |
Expand All @@ -79,6 +80,7 @@ podmon:
- "-mode=controller"
- "-arrayConnectivityPollRate=5"
- "-arrayConnectivityConnectionLossThreshold=3"
- "--skipArrayConnectionValidation=false"
- "--driver-config-params=/vxflexos-config-params/driver-config-params.yaml"
node:
args:
Expand All @@ -104,6 +106,7 @@ podmon:
- "-labelvalue=csi-unity"
- "-driverPath=csi-unity.dellemc.com"
- "-mode=controller"
- "--skipArrayConnectionValidation=false"
- "--driver-config-params=/unity-config/driver-config-params.yaml"
node:
args:
Expand Down Expand Up @@ -135,7 +138,7 @@ This is a list of parameters that can be adjusted for CSM for Resiliency:
| PODMON_NODE_LOG_LEVEL | String | "debug" |Logging level for the node podmon sidecar. Standard values: 'info', 'error', 'warning', 'debug', 'trace' |
| PODMON_ARRAY_CONNECTIVITY_POLL_RATE | Integer (>0) | 15 |An interval in seconds to poll the underlying array |
| PODMON_ARRAY_CONNECTIVITY_CONNECTION_LOSS_THRESHOLD | Integer (>0) | 3 |A value representing the number of failed connection poll intervals before marking the array connectivity as lost |
| PODMON_SKIP_ARRAY_CONNECTION_VALIDATION | Boolean | false |Flag to disable the array connectivity check |
| PODMON_SKIP_ARRAY_CONNECTION_VALIDATION | Boolean | false |Flag to disable the array connectivity check, set to true for NoSchedule or NoExecute taint due to K8S Control Plane failure (kubelet failure) |

Here is an example of the parameters:

Expand Down
2 changes: 2 additions & 0 deletions content/docs/resiliency/usecases.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,5 @@ CSM for Resiliency's design is focused on detecting the following types of hardw
2. K8S Control Plane Network Failure. Control Plane Network Failure often has the same K8S failure signature (the node is tainted with NoSchedule or NoExecute). However, if there is a separate Array I/O interface, CSM for Resiliency can often detect that the Array I/O Network may be active even though the Control Plane Network is down.

3. Array I/O Network failure is detected by polling the array to determine if the array has a healthy connection to the node. The capabilities to do this vary greatly by array and communication protocol type (Fibre Channel, iSCSI, NFS, NVMe, or PowerFlex SDC IP protocol). By monitoring the Array I/O Network separately from the Control Plane Network, CSM for Resiliency has two different indicators of whether the node is healthy or not.

4. K8S Control Plane Failure. Control Plane Failure is defined as failure of kubelet in a given node. K8S Control Plane failures are generally discovered by receipt of a Node event with a NoSchedule or NoExecute taint, or detection of such a taint when retrieving the Node via the K8S API.