From 2cd2270cce95c098f93f932b9c84a2a4dd653b4c Mon Sep 17 00:00:00 2001 From: Suleyman Akbas Date: Tue, 2 May 2023 10:37:44 +0200 Subject: [PATCH] chore: improve the design docs Signed-off-by: Suleyman Akbas --- README.md | 20 ++--- controllers/lvmcluster_controller.go | 2 +- docs/README.md | 17 +--- docs/design/lvm-operator-manager.md | 60 +++++++++++++ docs/design/lvmo-units.md | 23 ----- docs/design/operator.md | 41 --------- docs/design/reconciler.md | 89 +++++++------------ docs/design/thin-pool.md | 126 --------------------------- docs/design/thin-provisioning.md | 121 +++++++++++++++++++++++++ docs/design/topolvm-csi.md | 35 -------- docs/design/vg-manager.md | 35 ++------ 11 files changed, 232 insertions(+), 337 deletions(-) create mode 100644 docs/design/lvm-operator-manager.md delete mode 100644 docs/design/lvmo-units.md delete mode 100644 docs/design/operator.md delete mode 100644 docs/design/thin-pool.md create mode 100644 docs/design/thin-provisioning.md delete mode 100644 docs/design/topolvm-csi.md diff --git a/README.md b/README.md index 78390cc14..23d949ec3 100644 --- a/README.md +++ b/README.md @@ -107,12 +107,12 @@ After the CR is deployed, the following actions are executed: - A Logical Volume Manager (LVM) volume group named `vg1` is created, utilizing all available disks on the cluster. - A thin pool named `thin-pool-1` is created within `vg1`, with a size equivalent to 90% of `vg1`. - The TopoLVM Container Storage Interface (CSI) plugin is deployed, resulting in the launch of the `topolvm-controller` and `topolvm-node` pods. -- A Storage Class and a Volume Snapshot Class are created, both named `lvms-vg1`. This facilitates storage provisioning for OpenShift workloads. The Storage Class is configured with the `WaitForFirstConsumer` volume binding mode that is utilized in a multi-node configuration to optimize the scheduling of pod placement. This strategy prioritizes the allocation of pods to nodes with the greatest amount of available storage capacity. +- A storage class and a volume snapshot class are created, both named `lvms-vg1`. This facilitates storage provisioning for OpenShift workloads. The storage class is configured with the `WaitForFirstConsumer` volume binding mode that is utilized in a multi-node configuration to optimize the scheduling of pod placement. This strategy prioritizes the allocation of pods to nodes with the greatest amount of available storage capacity. - The LVMS system also creates two additional internal CRs to support its functionality: - * `LVMVolumeGroup` is generated and managed by LVMS to monitor the individual Volume Groups across multiple nodes in the cluster. - * `LVMVolumeGroupNodeStatus` is created by the VG Manager. This CR is used to monitor the status of volume groups on individual nodes in the cluster. + * `LVMVolumeGroup` is generated and managed by LVMS to monitor the individual volume groups across multiple nodes in the cluster. + * `LVMVolumeGroupNodeStatus` is created by the [Volume Group Manager](docs/design/vg-manager.md). This CR is used to monitor the status of volume groups on individual nodes in the cluster. -Wait until the LVMCluster reaches the `Ready` status: +Wait until the `LVMCluster` reaches the `Ready` status: ```bash $ oc get lvmclusters.lvm.topolvm.io my-lvmcluster @@ -128,11 +128,11 @@ $ oc get pods -w The `topolvm-node` pod remains in the initialization phase until the `vg-manager` completes all the necessary preparations. -Once all the pods have been launched, the LVMS is ready to manage your Logical Volumes and make them available for use in your applications. +Once all the pods have been launched, the LVMS is ready to manage your logical volumes and make them available for use in your applications. ### Inspecting the storage objects on the node -Prior to the deployment of the Logical Volume Manager Storage (LVMS), there are no pre-existing LVM Physical Volumes (PVs), Volume Groups (VGs), or Logical Volumes (LVs) associated with the disks. +Prior to the deployment of the Logical Volume Manager Storage (LVMS), there are no pre-existing LVM physical volumes, volume groups, or logical volumes associated with the disks. ```bash sh-4.4# lsblk @@ -228,7 +228,7 @@ spec: EOF ``` -Once the pod has been created and associated with the corresponding PVC, the PVC will be bound, and the pod will transition to the Running state in due course. +Once the pod has been created and associated with the corresponding PVC, the PVC will be bound, and the pod will transition to the `Running` state in due course. ```bash $ oc get pvc,pods @@ -246,21 +246,21 @@ To perform a full cleanup, follow these steps: 1. Remove all the application pods which are using PVCs created with LVMS, and then remove all these PVCs. -2. Ensure that there are no remaining LogicalVolume custom resources that were created by LVMS. +2. Ensure that there are no remaining `LogicalVolume` custom resources that were created by LVMS. ```bash $ oc get logicalvolumes.topolvm.io No resources found ``` -3. Remove the LVMCluster CR. +3. Remove the `LVMCluster` CR. ```bash $ oc delete lvmclusters.lvm.topolvm.io my-lvmcluster lvmcluster.lvm.topolvm.io "my-lvmcluster" deleted ``` -4. Verify that the only remaining resource in the `openshift-storage` namespace is the operator. +4. Verify that the only remaining resource in the `openshift-storage` namespace is the Operator. ```bash oc get pods -n openshift-storage diff --git a/controllers/lvmcluster_controller.go b/controllers/lvmcluster_controller.go index efe8dd7b0..4286ce77f 100644 --- a/controllers/lvmcluster_controller.go +++ b/controllers/lvmcluster_controller.go @@ -335,7 +335,7 @@ func (r *LVMClusterReconciler) getExpectedVgCount(ctx context.Context, instance return vgCount, nil } -// NOTE: when updating this, please also update docs/design/operator.md +// NOTE: when updating this, please also update docs/design/reconciler.md type resourceManager interface { // getName should return a camelCase name of this unit of reconciliation diff --git a/docs/README.md b/docs/README.md index 66e136291..fac557673 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,15 +1,6 @@ # Contents -1. [Reconcile][reconciler] -2. [VG Manager][vg-manager] -3. [CSI Units][topolvm-csi] -4. [LVMO Units][lvmo-units] -5. [Thin Pools][thin_pool] -6. [Operator][operator] - -[reconciler]: design/reconciler.md -[vg-manager]: design/vg-manager.md -[topolvm-csi]: design/topolvm-csi.md -[lvmo-units]: design/lvmo-units.md -[thin_pool]: design/thin-pool.md -[operator]: design/operator.md +1. [Reconciler Design](design/reconciler.md) +2. [The LVM Operator Manager](design/lvm-operator-manager.md) +2. [The Volume Group Manager](design/vg-manager.md) +5. [Thin Provisioning](design/thin-provisioning.md) diff --git a/docs/design/lvm-operator-manager.md b/docs/design/lvm-operator-manager.md new file mode 100644 index 000000000..bfe6f87d8 --- /dev/null +++ b/docs/design/lvm-operator-manager.md @@ -0,0 +1,60 @@ +# The LVM Operator Manager + +The LVM Operator Manager runs the LVM Cluster controller/reconciler that manages the following reconcile units: + +- [LVMCluster Custom Resource (CR)](#lvmcluster-custom-resource--cr-) +- [TopoLVM CSI](#topolvm-csi) + * [CSI Driver](#csi-driver) + * [TopoLVM Controller](#topolvm-controller) + * [Topolvm Node and lvmd](#topolvm-node-and-lvmd) + * [TopoLVM Scheduler](#topolvm-scheduler) +- [Storage Classes](#storage-classes) +- [Volume Group Manager](#volume-group-manager) +- [LVM Volume Groups](#lvm-volume-groups) +- [Openshift Security Context Constraints (SCCs)](#openshift-security-context-constraints--sccs-) + +Upon receiving a valid [LVMCluster custom resource](#lvmcluster-custom-resource--cr-), the LVM Cluster Controller initiates the reconciliation process to set up the TopoLVM Container Storage Interface (CSI) along with all the required resources for using locally available storage through Logical Volume Manager (LVM). + +## LVMCluster Custom Resource (CR) + +The LVMCluster CR is a crucial component of the LVM Operator, as it represents the volume groups that should be created and managed across nodes with custom node selector, toleration, and device selectors. This CR must be created and edited by the user in the namespace where the Operator is also installed. However, it is important to note that only a single CR instance is supported. The user can choose to specify the devices in `deviceSelector.paths` field to be used for the volume group, or if no paths are specified, all available disks will be used. The `status` field is updated based on the status of volume group creation across nodes. It is through the LVMCluster CR that the LVM Operator can create and manage the required volume groups, ensuring that they are available for use by the applications running on the OpenShift cluster. + +The LVM Cluster Controller generates an LVMVolumeGroup CR for each `deviceClass` present in the LVMCluster CR. The Volume Group Manager controller manages the reconciliation of the LVMVolumeGroups. The LVM Cluster Controller also collates the device class status across nodes from LVMVolumeGroupNodeStatus and updates the status of LVMCluster CR. + +> Note: Each device class corresponds to a single volume group. + +## TopoLVM CSI + +The LVM Operator deploys the TopoLVM CSI plugin, which enables dynamic provisioning of local storage. For more detailed information about TopoLVM, consult the [TopoLVM documentation](https://github.com/topolvm/topolvm/tree/main/docs). + +### CSI Driver + +The *csiDriver* reconcile unit creates the TopoLVM CSIDriver resource. + +### TopoLVM Controller + +The *topolvmController* reconcile unit is responsible for deploying a single instance of the TopoLVM Controller plugin Deployment and ensuring that any necessary updates are made to the Deployment. As part of this process, an init container is used to generate openssl certificates that are utilized by the TopoLVM Controller. However, it should be noted that this method will be replaced with the use of cert-manager in the near future. + +### Topolvm Node and lvmd + +The *topolvmNode* reconcile unit is responsible for deploying and managing the TopoLVM Node plugin and lvmd daemon set. It scales the DaemonSet based on the node selector specified in the devicesClasses field in the LVMCluster CR. During initialization, an init container polls for the availability of the lvmd configuration file before starting the `lvmd` and `topolvm-node` containers. + +### TopoLVM Scheduler + +The TopoLVM Scheduler is **not** used in LVMS for scheduling Pods. Instead, the CSI StorageCapacity tracking feature is utilized by the Kubernetes scheduler to determine the Node on which to provision storage. This feature provides the necessary information to the scheduler regarding the available storage on each Node, allowing it to make an informed decision about where to place the Pod. + +## Storage Classes + +The *topolvmStorageClass* reconcile unit is responsible for creating and managing all storage classes associated with the device classes specified in the LVMCluster CR. Each storage class is named with a prefix of 'lvms-' followed by the name of the corresponding device class in the LVMCluster CR. + +## Volume Group Manager + +The *vgManager* reconcile unit is responsible for deploying and managing the [Volume Group Manager](./vg-manager.md). + +## LVM Volume Groups + +The *lvmVG* reconcile unit is responsible for deploying and managing the LVMVolumeGroup CRs. It creates individual LVMVolumeGroup CRs for each deviceClass specified in the LVMCluster CR. These CRs are then used by the [Volume Group Manager](./vg-manager.md) to create volume groups and generate the lvmd config file for TopoLVM. + +## Openshift Security Context Constraints (SCCs) + +The Operator requires elevated permissions to interact with the host's LVM commands, which are executed through `nsenter`. When deployed on an OpenShift cluster, all the necessary Security Context Constraints (SCCs) are created by the *openshiftSccs* reconcile unit. This ensures that the `vg-manager`, `topolvm-node`, and `lvmd` containers have the required permissions to function properly. diff --git a/docs/design/lvmo-units.md b/docs/design/lvmo-units.md deleted file mode 100644 index 340788484..000000000 --- a/docs/design/lvmo-units.md +++ /dev/null @@ -1,23 +0,0 @@ -## LVM Volume Groups - -- *lvmVG* reconcile units deploys and manages LVMVolumeGroup CRs -- The LVMVG resource manager creates individual LVMVolumeGroup CRs for each - deviceClass in the LVMCluster CR. The vgmanager controller watches the LVMVolumeGroup - and creates the required volume groups on the individual nodes based on the - specified deviceSelector and nodeSelector. -- The corresponding CRs forms the basis of `vgManager` unit to create volume - groups and the lvmd config file for TopoLVM. - -## Openshift SCCs - -- When the operator is deployed on an Openshift cluster all the required - SCCs are created by `openshiftSccs` reconcile unit -- The `vg-manager`, `topolvm-node` and `lvmd` containers need elevated - permissions to access host LVM commands using `nsenter`. - -## Storage Classes - -- *topolvmStorageClass* resource units creates and manages all the storage - classes corresponding to the deviceClasses in the LVMCluster -- Storage Class name is generated with a prefix "lvms-" added to name of the - device class in the LVMCluster CR diff --git a/docs/design/operator.md b/docs/design/operator.md deleted file mode 100644 index 0f4e0fc65..000000000 --- a/docs/design/operator.md +++ /dev/null @@ -1,41 +0,0 @@ -# Operator design - -# Controllers and their managed resources - - -- **lvmcluster-controller:** Running in the operator deployment, it will create all resources that don't require information from the node. When applicable, the health of the underlying resource is updated in the LVMCluster status.: - - vgmanager daemonset - - lvmd daemonset - - TopoLVM CSIDriver CR - - TopoLVM CSI Driver Controller Deployment (controller is the name of the csi-component) - - TopoLVM CSI Driver Node Daemonset - - needs an initContainer to block until lvmd config file is read -- **The vg-manager:** A daemonset with one instance per selected node, it will create all resources that require knowledge from the node. - - volumegroups and thinpools - - lvmd config file - - - -Each unit of reconciliation should implement the reconcileUnit interface. -This will be run by the controller, and errors and success will be propagated to the status and events. -This interface is defined in [lvmcluster_controller.go](../../controllers/lvmcluster_controller.go) - -``` -type resourceManager interface { - - // getName should return a camelCase name of this unit of reconciliation - getName() string - - // ensureCreated should check the resources managed by this unit - ensureCreated(*LVMClusterReconciler, context.Context, lvmv1alpha1.LVMCluster) error - - // ensureDeleted should wait for the resources to be cleaned up - ensureDeleted(*LVMClusterReconciler, context.Context, lvmv1alpha1.LVMCluster) error - - // updateStatus should optionally update the CR's status about the health of the managed resource - // each unit will have updateStatus called induvidually so - // avoid status fields like lastHeartbeatTime and have a - // status that changes only when the operands change. - updateStatus(*LVMClusterReconciler, context.Context, lvmv1alpha1.LVMCluster) error -} -``` diff --git a/docs/design/reconciler.md b/docs/design/reconciler.md index e90a10b29..b3c831588 100644 --- a/docs/design/reconciler.md +++ b/docs/design/reconciler.md @@ -1,59 +1,30 @@ -# Operator design - -## Controllers and their managed resources - -### lvmcluster-controller - -- On receiving a valid LVMCluster CR, lvmcluster-controller reconciles the - following resource units for setting up [Topolvm](topolvm-repo) CSI and all - the supporting resources to use storage local to the node via Logical Volume - Manager (lvm) -- *csiDriver*: Reconciles TopoLVM CSI Driver -- *topolvmController*: Reconciles TopoLVM controller plugin -- *lvmVG*: Reconciles volume groups from LVMCluster CR -- *openshiftSccs*: Manages SCCs when the operator is run in Openshift - environment -- *topolvmNode*: Reconciles TopoLVM nodeplugin along with lvmd -- *vgManager*: Responsible for creation of Volume Groups -- *topolvmStorageClass*: Manages storage class life cycle based on - devicesClasses in LVMCluster CR -- The LVMO creates an LVMVolumeGroup CR for each deviceClass in the - LVMCluster CR. The LVMVolumeGroups are reconciled by the vgmanager controllers. -- In addition to managing the above resource units, lvmcluster-controller collates - the status of deviceClasses across nodes from LVMVolumeGroupNodeStatus and - updates status of LVMCluster CR -- `resourceManager` interface is defined in - [lvmcluster_controller.go][contorller] -- Depending on the resource unit some of the methods can be no-op - -Note: -- Above names refers to the struct which satisfies `resourceManager` interface -- Please refer to the topolvm [design][topolvm-design] doc to know more about TopoLVM - CSI -- Any new resource units should also implement `resourceManager` interface - -### Lifecycle of Custom Resources - -- [LVMCluster CR][lvmcluster] represents the volume groups that should be - created and managed across nodes with custom node selector, toleration and - device selectors -- Should be created and edited by user in operator installed namespace -- Only a single CR instance with a single volume group is supported. -- The user can choose to specify the devices to be used for the volumegroup. -- All available disks will be used if no devicePaths are specified,. -- All fields in `status` are updated based on the status of volume groups - creation across nodes - -Note: -- Device Class and Volume Group can be read interchangeably -- Multiple CRs exist to separate concerns of which component deployed by LVMO - updates which CR there by reducing multiple reconcile loops and colliding - requests/updates to Kubernetes API Server -- Feel free to raise a github [issue][issue] for open discussions about API - changes if required - -[topolvm-repo]: https://github.com/topolvm/topolvm -[topolvm-design]: https://github.com/topolvm/topolvm/blob/main/docs/design.md -[controller]: ../../controllers/lvmcluster_controller.go -[lvmcluster]: ../../api/v1alpha1/lvmcluster_types.go -[issue]: https://github.com/openshift/lvm-operator/issues +# Operator Design + +The LVM Operator consists of two managers: + +- The [LVM Operator Manager](lvm-operator-manager.md) runs in a deployment called `lvms-operator` and manages multiple reconciliation units. +- The [Volume Group Manager](vg-manager.md) runs in a daemon set called `vg-manager` and manages a single reconciliation unit. + +### Implementation Notes + +Each unit of reconciliation should implement the reconcileUnit interface. This will be run by the controller, and errors and success will be propagated to the status and events. This interface is defined in [lvmcluster_controller.go](../../controllers/lvmcluster_controller.go) + +```go +type resourceManager interface { + + // getName should return a camelCase name of this unit of reconciliation + getName() string + + // ensureCreated should check the resources managed by this unit + ensureCreated(*LVMClusterReconciler, context.Context, lvmv1alpha1.LVMCluster) error + + // ensureDeleted should wait for the resources to be cleaned up + ensureDeleted(*LVMClusterReconciler, context.Context, lvmv1alpha1.LVMCluster) error + + // updateStatus should optionally update the CR's status about the health of the managed resource + // each unit will have updateStatus called induvidually so + // avoid status fields like lastHeartbeatTime and have a + // status that changes only when the operands change. + updateStatus(*LVMClusterReconciler, context.Context, lvmv1alpha1.LVMCluster) error +} +``` diff --git a/docs/design/thin-pool.md b/docs/design/thin-pool.md deleted file mode 100644 index 8a2b61d50..000000000 --- a/docs/design/thin-pool.md +++ /dev/null @@ -1,126 +0,0 @@ -# LVMO: Thin provisioning - -## Summary -- LVM thin provisioning allows the creation of volumes whose combined virtual size is greater than that of the available storage. - -**Advantages**: -- Storage space can be used more effectively. More users can be accommodated for the same amount of storage space when compared to thick provisioning. This significantly reduces upfront hardware cost for the storage admins. -- Faster clones and snapshots - -**Disadvantages** : -- Reduced performance when compared to thick volumes. -- Over-allocation of the space. This can be mitigated by better monitoring of the disk usage. - -LVM does this by allocating the blocks in a thin LV from a special "thin pool LV". A thin pool LV must be created before thin LVs can be created within it. - -The LVMO will create a thin-pool LV in the volume group in order to create thinly provisioned volumes. - - -## Proposal: -- The `deviceClass` API in the `LVMClusterSpec` will contain the mapping between a device-class and a thin-pool in volume group. -- One device-class will be mapped to a single thin pool. -- User should be able to configure the thin-pool size based on percentage of the available volume group size. -- Default chunk size of the thin pool will be 128 kiB -- `lvmd.yaml` config file should be updated with the device class, volume group and thin-pool mapping. -- Alerts should be triggered if the thin-pool `data` or `metadata` usage crosses a predefined threshold limit. - - -## Design Details -### API changes: - -- `LVMClusterSpec.Storage.DeviceClass.ThinPoolConfig` will have the mapping between device class, volume group and the thin-pool. -- One DeviceClass can be mapped to only one thin-pool. - -- `LVMCluster` API changes -```go= -+ type ThinPoolConfig struct{ -+ // Name of the thin pool to be created -+ // +kubebuilder:validation:Required -+ // +required -+ Name string `json:"name,omitempty"` - -+ // SizePercent represents percentage of remaining space in the volume group that should be used -+ // for creating the thin pool. -+ // +kubebuilder:validation:default=90 -+ // +kubebuilder:validation:Minimum=10 -+ // +kubebuilder:validation:Maximum=90 -+ SizePercent int `json:"sizePercent,omitempty"` - -+ // OverProvisionRatio represents the ratio of overprovision that can -+ // be allowed on thin pools -+ // +kubebuilder:validation:Minimum=2 -+ OverprovisionRatio int `json:"overprovisionRatio,omitempty"` -} - -type DeviceClass struct { - Name string `json:"name,omitempty"` - - DeviceSelector *DeviceSelector `json:"deviceSelector,omitempty"` - NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"` - -+ // ThinPoolConfig contains configurations for the thin-pool -+ // +kubebuilder:validation:Required -+ // +required -+ ThinPoolConfig *ThinPoolConfig `json:"thinPoolConfig,omitempty"` -} -``` - - -- Following new fields will added to `DeviceClass` API - - **ThinPoolConfig** API contains information related to a thin pool.These configuration options are: - - **Name**: Name of the thin-pool - - **SizePercent**: Size of the thin pool to be created with respect to available free space in the volume group. It represents percentage value and not absolute size values. Size value should range between 10-90. It defaults to 90 if no value is provided. - - **OverprovisionRatio**: The factor by which additional storage can be provisioned compared to the available storage in the thin pool. - -- `LVMVolumeGroup` API changes: - -``` go= -type LVMVolumeGroupSpec struct { - // DeviceSelector is a set of rules that should match for a device to be - // included in this VolumeGroup - // +optional - DeviceSelector *DeviceSelector `json:"deviceSelector,omitempty"` - - // NodeSelector chooses nodes - // +optional - NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"` - -+ // ThinPoolConfig contains configurations for the thin-pool -+ // +kubebuilder:validation:Required -+ // +required -+ ThinPoolConfig *ThinPoolConfig `json:"thinPoolConfig,omitempty"` -} -``` - -### VolumeGroup Manager -- Volume Group Manager is responsible for creating the thin-pools after creating the volume group. -- Command used for creating the thin-pool: - ``` - lvcreate -L %FREE -c -T / - ``` - where: - - Size is `LVMClusterSpec.Storage.DeviceClass.ThinPoolConfig.SizePercent` - - chunk size is 128KiB, which is the default. - -- VG manager will also update the `lvmd.yaml` file to map volume group and its thin-pool to the topolvm device class. -- Sample `lvmd.yaml` config file -``` yaml= -device-classes: - - name: ssd-thin - volume-group: myvg1 - spare-gb: 10 - type: thin - thin-pool-config: - name: pool0 - overprovision-ratio: 5.0 -``` - -### Monitoring and Alerts -- Available thin-pool size (both data and metadata) should be provided by topolvm as prometheus metrics. -- Threshold limits for the thin-pool should be provide as static values in the PrometheusRule. -- If the data or metadata usage for a particular thin-pool crosses a threshold, appropriate alerts should be triggered. - - -### Open questions -- What should be the chunk size of the thin-pools? - - Use default size a 128 kiB for now. diff --git a/docs/design/thin-provisioning.md b/docs/design/thin-provisioning.md new file mode 100644 index 000000000..198b07272 --- /dev/null +++ b/docs/design/thin-provisioning.md @@ -0,0 +1,121 @@ +# Thin Provisioning + +## Summary +- LVM thin provisioning allows the creation of volumes whose combined virtual size is greater than that of the available storage. + +**Advantages**: +- Storage space can be used more effectively. More users can be accommodated for the same amount of storage space when compared to thick provisioning. This significantly reduces upfront hardware cost for the storage admins. +- Faster clones and snapshots + +**Disadvantages** : +- Reduced performance when compared to thick volumes. +- Over-allocation of the space. This can be mitigated by better monitoring of the disk usage. + +LVM does this by allocating the blocks in a thin LV from a special "thin pool LV". A thin pool LV must be created before thin LVs can be created within it. + +The LVMS will create a thin pool LV in the Volume Group in order to create thinly provisioned volumes. + +## Design Details + +- The `deviceClass` API in the `LVMClusterSpec` contains the mapping between a device-class and a thin-pool in volume group. +- One device-class is mapped to a single thin pool. +- Users can configure the thin pool size based on percentage of the available Volume Group size. +- Default chunk size of the thin pool is 128 kiB. +- `lvmd.yaml` config file is updated with the device class, volume group and thin-pool mapping. +- Alerts are triggered if the thin-pool `data` or `metadata` usage crosses a predefined threshold limit. + +### API + +- `LVMClusterSpec.Storage.DeviceClass.ThinPoolConfig` has the mapping between device class, volume group and the thin-pool. +- One DeviceClass can be mapped to only one thin-pool. + +- `LVMCluster` API changes: + ```go + + type ThinPoolConfig struct{ + + // Name of the thin pool to be created + + // +kubebuilder:validation:Required + + // +required + + Name string `json:"name,omitempty"` + + + // SizePercent represents percentage of remaining space in the volume group that should be used + + // for creating the thin pool. + + // +kubebuilder:validation:default=90 + + // +kubebuilder:validation:Minimum=10 + + // +kubebuilder:validation:Maximum=90 + + SizePercent int `json:"sizePercent,omitempty"` + + + // OverProvisionRatio represents the ratio of overprovision that can + + // be allowed on thin pools + + // +kubebuilder:validation:Minimum=2 + + OverprovisionRatio int `json:"overprovisionRatio,omitempty"` + } + + type DeviceClass struct { + Name string `json:"name,omitempty"` + + DeviceSelector *DeviceSelector `json:"deviceSelector,omitempty"` + NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"` + + + // ThinPoolConfig contains configurations for the thin-pool + + // +kubebuilder:validation:Required + + // +required + + ThinPoolConfig *ThinPoolConfig `json:"thinPoolConfig,omitempty"` + } + ``` + +- Following new fields are added to `DeviceClass` API + - **ThinPoolConfig** API contains information related to a thin pool.These configuration options are: + - **Name**: Name of the thin-pool + - **SizePercent**: Size of the thin pool to be created with respect to available free space in the volume group. It represents percentage value and not absolute size values. Size value should range between 10-90. It defaults to 90 if no value is provided. + - **OverprovisionRatio**: The factor by which additional storage can be provisioned compared to the available storage in the thin pool. + +- `LVMVolumeGroup` API changes: + + ```go + type LVMVolumeGroupSpec struct { + // DeviceSelector is a set of rules that should match for a device to be + // included in this VolumeGroup + // +optional + DeviceSelector *DeviceSelector `json:"deviceSelector,omitempty"` + + // NodeSelector chooses nodes + // +optional + NodeSelector *corev1.NodeSelector `json:"nodeSelector,omitempty"` + + + // ThinPoolConfig contains configurations for the thin-pool + + // +kubebuilder:validation:Required + + // +required + + ThinPoolConfig *ThinPoolConfig `json:"thinPoolConfig,omitempty"` + } + ``` + +### Volume Group Manager +- [Volume Group Manager](vg-manager.md) is responsible for creating the thin pools after creating the volume group. +- Command used for creating a thin pool: + + ```bash + lvcreate -L %FREE -c -T / + ``` + + where: + - Size is `LVMClusterSpec.Storage.DeviceClass.ThinPoolConfig.SizePercent` + - chunk size is 128KiB, which is the default. + +- VG manager also updates the `lvmd.yaml` file to map Volume Group and its thin pool to the TopoLVM device class. +- Sample `lvmd.yaml` config file + +```yaml +device-classes: + - name: ssd-thin + volume-group: myvg1 + spare-gb: 10 + type: thin + thin-pool-config: + name: pool0 + overprovision-ratio: 5.0 +``` + +### Monitoring and Alerts +- Available thin pool size (both data and metadata) is provided by TopoLVM as prometheus metrics. +- Threshold limits for the thin pool are provided as static values in the PrometheusRule. +- If the data or metadata usage for a particular thin-pool crosses a threshold, appropriate alerts are triggered. diff --git a/docs/design/topolvm-csi.md b/docs/design/topolvm-csi.md deleted file mode 100644 index cbc590611..000000000 --- a/docs/design/topolvm-csi.md +++ /dev/null @@ -1,35 +0,0 @@ -# TopoLVM CSI - -- LVM Operator deploys the TopoLVM CSI plugin which provides dynamic provisioning of - local storage. -- Please refer to TopoLVM [docs][topolvm-docs] for more details on topolvm - -## CSI Driver - -- *csiDriver* reconcile unit creates the Topolvm CSIDriver resource - -## TopoLVM Controller - -- *topolvmController* reconcile unit deploys a single TopoLVM Controller plugin - deployment and manages any updates to the deployment -- The TopoLVM scheduler is not used for pod scheduling. The CSI StorageCapacity - tracking feature is used by the scheduler to determine the node on which - to provision storage. -- An init container generates openssl certs to be used in topolvm-controller - which will be soon replaced with cert-manager - -## Topolvm Node and LVMd - -- *topolvmNode* reconcile unit deploys and manages the TopoLVM node plugin and lvmd - daemonset and scales it based on the node selector specified in the devicesClasses - in LVMCluster -- An init container polls for the availability of lvmd config file before - starting the lvmd and topolvm-node containers - -## Deletion - -- All the resources above will be removed by their respective reconcile units when - LVMCluster CR governing then is deleted - - -[topolvm-docs]: https://github.com/topolvm/topolvm/tree/main/docs diff --git a/docs/design/vg-manager.md b/docs/design/vg-manager.md index 6616311cc..e68a4aaf3 100644 --- a/docs/design/vg-manager.md +++ b/docs/design/vg-manager.md @@ -1,36 +1,13 @@ -# Volume Group Manager +# The Volume Group Manager -## Creation - -- `vg-manager` daemonset pods are created by the LVMCluster controller on LVMCluster CR creation -- They run on all nodes which match the Node Selector specified in - the CR. They run on all schedulable nodes if no nodeSelector is specified. -- A controller owner reference is set on the daemonset so it is cleaned up - when the LVMCluster CR is deleted. - -## Reconciliation - -- The vg-manager daemonset consists of individual controller pods, each of - which handles the on node operations for the node it is running on. -- The vg-manager controller reconciles LVMVolumeGroup CRs which are created - by the LVMO. -- The vg-manager will determine the disks that match the filters - specified (currently not implemented) on the node it is running on and create - an LVM VG with them. It then creates the lvmd.yaml config file for lvmd. -- vg-manager also updates LVMVolumeGroupStatus with observed status of volume - groups for the node on which it is running +The Volume Group Manager manages a single controller/reconciler, which runs as `vg-manager` daemon set pods on a cluster. They are responsible for performing on-node operations for the node they are running on. They first identify disks that match the filters specified for the node. Next, they watch for the LVMVolumeGroup resource and create the necessary volume groups and thin pools on the node based on the specified deviceSelector and nodeSelector. Once the volume groups are created, vg-manager generates the `lvmd.yaml` configuration file for lvmd to use. Additionally, vg-manager updates the LVMVolumeGroupStatus with the observed status of the volume groups on the node where it is running. ## Deletion -- `vg-manager` daemonset is garbage collected when LVMCluster CR is deleted +A controller owner reference is set on the daemon set, so it is cleaned up when the LVMCluster CR is deleted. ## Considerations -- Storing lvmd config file on host seemed to be better when compared against - below options: - 1. Single configmap: Storing all the lvmd config file contents across nodes - into a single configmap involves extra processing to segment the config - according to the node and save that before being consumed by lvmd - 2. Multiple configmaps: Although this is doable having multiple configmaps - limits topolvm nodeplugin not to be deployed as a daemonset since configmap - should be unique for a daemonset +Storing the lvmd config file on the host provides a superior solution when compared to other options: +- Single config map: The process of storing the configuration file contents of lvmd across multiple nodes in a single config map requires additional processing to segment the configuration based on each individual node and store it accordingly before it can be consumed by lvmd. +- Multiple config maps: Although technically possible, using multiple config maps to store lvmd config file contents across nodes would limit the deployment of TopoLVM node plugin as a daemon set. This is because each daemon set requires a unique config map.