Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Process for Scaling nodes' disks - increase/replace disks. #112

Closed
prudhvigodithi opened this issue Apr 20, 2022 · 10 comments

Comments

@prudhvigodithi
Copy link
Member

Hey just created this issue as a thought on we can achieve this disk reconciler task, to increase the disk size.

Solution [1]:

  • Blue/Green model to add new nodes with new disk size, adjust the OpenSearch cluster and remove gracefully the old nodes and underlying PV's
  • Cluster should be functional with this approach.
  • On failure, remove the new added nodes and again re-adjust the OpenSearch cluster.

Solution [2]:

  • Work with existing Volumes, just increase the PVC's size and allow the k8's and underlying cloud provider to take care of backend volume adjustment and restart the OpenSearch cluster each pod one at a time.
@segalziv
Copy link
Contributor

As the operator is cluster agnostic I think we should take the first approach of starting with Blue/Green deployment.

The second solution for increasing the existing volume is an optimization that is disk and cloud-specific, we can add those optimizations in the future as needed.

@swoehrl-mw
Copy link
Collaborator

I agree. The blue/green model should be the base/fallback approach. PVC resizing can be, as @segalziv stated, an optimization if it is available, maybe in a way that the cluster-admin can activate/deactivate the feature when installing the operator (under the assumption that the cluster admin knows best if PVC resizing works in the kubernetes cluster).

@prudhvigodithi
Copy link
Member Author

prudhvigodithi commented Apr 24, 2022

I have followed the following steps and able to scale the disk with no downtime: (not related to any underlying cloud provider)

kubectl get sts --output yaml > teststs.yaml

# leave the running pods and just delete the statefulset
kubectl delete sts my-first-cluster-masters  --cascade=false

##target the 1st pod to scale the disk
kubectl delete pods my-first-cluster-masters-0 

## use this pod my-first-cluster-masters-0 as 1st pod to get new disk size.
kubectl patch pvc data-my-first-cluster-masters-0 -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}'

## re apply the statefulset with new desired volume
kubectl apply -f teststs.yaml 

## Leave the 1st pod my-first-cluster-masters-0 running that has new disk size.
kubectl scale sts my-first-cluster-masters  --replicas=1

## Patch the remaining volumes
kubectl patch pvc data-my-first-cluster-masters-1 -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}'
kubectl patch pvc data-my-first-cluster-masters-2 -p '{ "spec": { "resources": { "requests": { "storage": "40Gi" }}}}'

## Scale back to existing pods
kubectl scale sts my-first-cluster-masters  --replicas=3

##validation 
k exec -it my-first-cluster-masters-0 -- df -h
Defaulted container "opensearch" out of: opensearch, init (init)
Filesystem      Size  Used Avail Use% Mounted on
overlay          80G   14G   67G  17% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme0n1p1   80G   14G   67G  17% /etc/hosts
shm              64M     0   64M   0% /dev/shm
**/dev/nvme1n1     40G   53M   40G   1% /usr/share/opensearch/data**
tmpfs           3.8G   12K  3.8G   1% /usr/share/opensearch/config/tls-http
tmpfs           3.8G   12K  3.8G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           3.8G   36K  3.8G   1% /usr/share/opensearch/config/tls-transport
tmpfs           3.8G     0  3.8G   0% /proc/acpi
tmpfs           3.8G     0  3.8G   0% /sys/firmware

This is ideally not a blue green way, but this approach we have functional cluster working while the disk scaling happens in the backend (like zero downtime).
I feel this approach is more clean and being done in k8s way.
The above works assuming the underlying storage class has allowVolumeExpansion: true, this is expected from user end to add this before considering scaling the disks, else we can add a logic in operator to pre check and stop scaling if this is not added.
Kubernetes enhancement issues to look for that are related to statefulset volume expansion
kubernetes/enhancements#661
kubernetes/enhancements#2842

So to summarize, working with blue/green is good, but adds some overhead from our end when there is straight process from k8s side, also we might get this feature out the box in future versions, then in that case we might need to revamp the blue/green strategy code to migrate to supported volume expansion process.
Open for further discussion.

@segalziv @swoehrl-mw @dbason @idanl21

@swoehrl-mw
Copy link
Collaborator

@prudhvigodithi Unless I'm missing something what you are doing is only using the resize feature for PVCs, which is only supported for select storage classes and clusters. So it will not work in all situations. For supported clusters using that is good, but we should still provide a solution for situations where automatic resize is not available.
From my perspective the only question is if we start implementation with the PVC resize and just document its limited support or start with the generic blue/green node replacements implementation and later add pvc resize as an optional optimization for supported clusters.

@prudhvigodithi
Copy link
Member Author

Hey @swoehrl-mw, you have a good point, but PVC resize has support with multiple storage classes, and this is the kubernetes way of doing the resizing
As per doc https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
Kubernetes v1.11 ships with volume expansion support for the following in-tree volume plugins: AWS-EBS, GCE-PD, Azure Disk, Azure File, Glusterfs, Cinder, Portworx, and Ceph RBD
This was we dont need to outsource with a custom solution and we will in k8s way of doing the resizing.

@prudhvigodithi
Copy link
Member Author

Also just to add @swoehrl-mw, since PVC resizing is standard way to increase disk's, I would suggest lets add this to the operator and based on feedback or user issues we can go blue/green approach.

Also I feel this blue/green approach we might need to re-engineer a complex solution just to add disk size, as statefulset directly does not allow to edit the volumes, we might need to recreate a new statefulset with new size and have a mechanism to copy the existing data and based on size of data this can lot of time.

Another Issue I see is, we have the TLS cluster formation happening with certs created with statefulset name
Example
/usr/share/opensearch/config/tls-transport/my-first-cluster-masters-test-0.crt
So in order to have blue/green we cannot directly add the new statefulset pods with new storage size to existing cluster. Some another bootstrapping logic is required, to create new certs again with new statefulset. I dont see this setup/idea is joining properly and looks to me like adding more complex logic :)

Just to summarize how about to go with k8s supported way and based on feedback we can consider blue/green?
@idanl21 @idanl21 @segalziv @swoehrl-mw

@kfox1111
Copy link

Some other options to leave the data in place.
currently you can just tweak the backing pv's to have the right size. it will honor it.
if you scale up the statefulset later, you need to make sure you resize those later too.

Second, I have not tested this, but you can do a delete of a statefulset (even without deleting the pods with the right flag I think) and the pvc's should stay. you should be able to then scale up the pvc and put the statefulset back with the updated value too. That should let future pvc creation from scaleup get the new value.

@prudhvigodithi
Copy link
Member Author

prudhvigodithi commented Apr 27, 2022

Second, I have not tested this, but you can do a delete of a statefulset (even without deleting the pods with the right flag I think) and the pvc's should stay. you should be able to then scale up the pvc and put the statefulset back with the updated value too. That should let future pvc creation from scaleup get the new value.

Exactly @kfox1111 thats true, this is what above steps does, the option is --cascade=false :)
#112 (comment)

@swoehrl-mw
Copy link
Collaborator

@prudhvigodithi I agree. Let's go with pvc resize first to cover the majority of clusters with (hopefully) little work on the operator part. And based on feedback we can think about solutions for the rest later on.

@prudhvigodithi
Copy link
Member Author

Hey Closing this issue as we have the initial PR #117 merged, will open if we have to go with new implementation process moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants