Implementation Process for Scaling nodes' disks - increase/replace disks. #112

prudhvigodithi · 2022-04-20T13:49:32Z

Hey just created this issue as a thought on we can achieve this disk reconciler task, to increase the disk size.

Solution [1]:

Blue/Green model to add new nodes with new disk size, adjust the OpenSearch cluster and remove gracefully the old nodes and underlying PV's
Cluster should be functional with this approach.
On failure, remove the new added nodes and again re-adjust the OpenSearch cluster.

Solution [2]:

Work with existing Volumes, just increase the PVC's size and allow the k8's and underlying cloud provider to take care of backend volume adjustment and restart the OpenSearch cluster each pod one at a time.

segalziv · 2022-04-20T13:58:53Z

As the operator is cluster agnostic I think we should take the first approach of starting with Blue/Green deployment.

The second solution for increasing the existing volume is an optimization that is disk and cloud-specific, we can add those optimizations in the future as needed.

swoehrl-mw · 2022-04-21T06:10:47Z

I agree. The blue/green model should be the base/fallback approach. PVC resizing can be, as @segalziv stated, an optimization if it is available, maybe in a way that the cluster-admin can activate/deactivate the feature when installing the operator (under the assumption that the cluster admin knows best if PVC resizing works in the kubernetes cluster).

prudhvigodithi · 2022-04-24T14:36:22Z

I have followed the following steps and able to scale the disk with no downtime: (not related to any underlying cloud provider)

kubectl get sts --output yaml > teststs.yaml

# leave the running pods and just delete the statefulset
kubectl delete sts my-first-cluster-masters  --cascade=false

##target the 1st pod to scale the disk
kubectl delete pods my-first-cluster-masters-0 

## use this pod my-first-cluster-masters-0 as 1st pod to get new disk size.
kubectl patch pvc data-my-first-cluster-masters-0 -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}'

## re apply the statefulset with new desired volume
kubectl apply -f teststs.yaml 

## Leave the 1st pod my-first-cluster-masters-0 running that has new disk size.
kubectl scale sts my-first-cluster-masters  --replicas=1

## Patch the remaining volumes
kubectl patch pvc data-my-first-cluster-masters-1 -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}'
kubectl patch pvc data-my-first-cluster-masters-2 -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}'

## Scale back to existing pods
kubectl scale sts my-first-cluster-masters  --replicas=3

##validation 
k exec -it my-first-cluster-masters-0 -- df -h
Defaulted container "opensearch" out of: opensearch, init (init)
Filesystem      Size  Used Avail Use% Mounted on
overlay          80G   14G   67G  17% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme0n1p1   80G   14G   67G  17% /etc/hosts
shm              64M     0   64M   0% /dev/shm
**/dev/nvme1n1     40G   53M   40G   1% /usr/share/opensearch/data**
tmpfs           3.8G   12K  3.8G   1% /usr/share/opensearch/config/tls-http
tmpfs           3.8G   12K  3.8G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           3.8G   36K  3.8G   1% /usr/share/opensearch/config/tls-transport
tmpfs           3.8G     0  3.8G   0% /proc/acpi
tmpfs           3.8G     0  3.8G   0% /sys/firmware

This is ideally not a blue green way, but this approach we have functional cluster working while the disk scaling happens in the backend (like zero downtime).
I feel this approach is more clean and being done in k8s way.
The above works assuming the underlying storage class has allowVolumeExpansion: true, this is expected from user end to add this before considering scaling the disks, else we can add a logic in operator to pre check and stop scaling if this is not added.
Kubernetes enhancement issues to look for that are related to statefulset volume expansion
kubernetes/enhancements#661
kubernetes/enhancements#2842

So to summarize, working with blue/green is good, but adds some overhead from our end when there is straight process from k8s side, also we might get this feature out the box in future versions, then in that case we might need to revamp the blue/green strategy code to migrate to supported volume expansion process.
Open for further discussion.

@segalziv @swoehrl-mw @dbason @idanl21

swoehrl-mw · 2022-04-27T14:43:05Z

@prudhvigodithi Unless I'm missing something what you are doing is only using the resize feature for PVCs, which is only supported for select storage classes and clusters. So it will not work in all situations. For supported clusters using that is good, but we should still provide a solution for situations where automatic resize is not available.
From my perspective the only question is if we start implementation with the PVC resize and just document its limited support or start with the generic blue/green node replacements implementation and later add pvc resize as an optional optimization for supported clusters.

prudhvigodithi · 2022-04-27T16:58:47Z

Hey @swoehrl-mw, you have a good point, but PVC resize has support with multiple storage classes, and this is the kubernetes way of doing the resizing
As per doc https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
Kubernetes v1.11 ships with volume expansion support for the following in-tree volume plugins: AWS-EBS, GCE-PD, Azure Disk, Azure File, Glusterfs, Cinder, Portworx, and Ceph RBD
This was we dont need to outsource with a custom solution and we will in k8s way of doing the resizing.

prudhvigodithi · 2022-04-27T23:36:43Z

Also just to add @swoehrl-mw, since PVC resizing is standard way to increase disk's, I would suggest lets add this to the operator and based on feedback or user issues we can go blue/green approach.

Also I feel this blue/green approach we might need to re-engineer a complex solution just to add disk size, as statefulset directly does not allow to edit the volumes, we might need to recreate a new statefulset with new size and have a mechanism to copy the existing data and based on size of data this can lot of time.

Another Issue I see is, we have the TLS cluster formation happening with certs created with statefulset name
Example
/usr/share/opensearch/config/tls-transport/my-first-cluster-masters-test-0.crt
So in order to have blue/green we cannot directly add the new statefulset pods with new storage size to existing cluster. Some another bootstrapping logic is required, to create new certs again with new statefulset. I dont see this setup/idea is joining properly and looks to me like adding more complex logic :)

Just to summarize how about to go with k8s supported way and based on feedback we can consider blue/green?
@idanl21 @idanl21 @segalziv @swoehrl-mw

kfox1111 · 2022-04-27T23:41:52Z

Some other options to leave the data in place.
currently you can just tweak the backing pv's to have the right size. it will honor it.
if you scale up the statefulset later, you need to make sure you resize those later too.

Second, I have not tested this, but you can do a delete of a statefulset (even without deleting the pods with the right flag I think) and the pvc's should stay. you should be able to then scale up the pvc and put the statefulset back with the updated value too. That should let future pvc creation from scaleup get the new value.

prudhvigodithi · 2022-04-27T23:44:30Z

Second, I have not tested this, but you can do a delete of a statefulset (even without deleting the pods with the right flag I think) and the pvc's should stay. you should be able to then scale up the pvc and put the statefulset back with the updated value too. That should let future pvc creation from scaleup get the new value.

Exactly @kfox1111 thats true, this is what above steps does, the option is --cascade=false :)
#112 (comment)

swoehrl-mw · 2022-04-28T14:05:48Z

@prudhvigodithi I agree. Let's go with pvc resize first to cover the majority of clusters with (hopefully) little work on the operator part. And based on feedback we can think about solutions for the rest later on.

prudhvigodithi · 2022-05-10T10:54:59Z

Hey Closing this issue as we have the initial PR #117 merged, will open if we have to go with new implementation process moving forward.

prudhvigodithi mentioned this issue May 1, 2022

Volume Expansion #117

Merged

prudhvigodithi closed this as completed May 10, 2022

prudhvigodithi mentioned this issue Apr 1, 2024

[BUG] Increased diskSize is not reflected in the pods filesystem #723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Process for Scaling nodes' disks - increase/replace disks. #112

Implementation Process for Scaling nodes' disks - increase/replace disks. #112

prudhvigodithi commented Apr 20, 2022

segalziv commented Apr 20, 2022

swoehrl-mw commented Apr 21, 2022

prudhvigodithi commented Apr 24, 2022 •

edited

Loading

swoehrl-mw commented Apr 27, 2022

prudhvigodithi commented Apr 27, 2022

prudhvigodithi commented Apr 27, 2022

kfox1111 commented Apr 27, 2022

prudhvigodithi commented Apr 27, 2022 •

edited

Loading

swoehrl-mw commented Apr 28, 2022

prudhvigodithi commented May 10, 2022

Implementation Process for Scaling nodes' disks - increase/replace disks. #112

Implementation Process for Scaling nodes' disks - increase/replace disks. #112

Comments

prudhvigodithi commented Apr 20, 2022

segalziv commented Apr 20, 2022

swoehrl-mw commented Apr 21, 2022

prudhvigodithi commented Apr 24, 2022 • edited Loading

swoehrl-mw commented Apr 27, 2022

prudhvigodithi commented Apr 27, 2022

prudhvigodithi commented Apr 27, 2022

kfox1111 commented Apr 27, 2022

prudhvigodithi commented Apr 27, 2022 • edited Loading

swoehrl-mw commented Apr 28, 2022

prudhvigodithi commented May 10, 2022

prudhvigodithi commented Apr 24, 2022 •

edited

Loading

prudhvigodithi commented Apr 27, 2022 •

edited

Loading