-
Notifications
You must be signed in to change notification settings - Fork 741
*: implement PersistentVolume for etcd data design part 1 #1434
Conversation
hack/test
Outdated
go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE} | ||
# Run tests with PV support enabled | ||
PV_TEST=true go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To see how the tests works on your jenkins I just rerun them enabling pv support. Do you have better ideas on how to do this without duplicating the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a separate set of test.
For example, test/e2e/etcd_on_pv_test.go .
We will pool some PV ahead to reduce time overhead and cost on cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this part doesn't change the logic but only places etcd data on a PV I was thinking to redo all the tests with PV enabled. Using a different test files means a lot of duplicated code and maintenance burden. Perhaps we could have just different timeouts for the non PV and PV cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hongchaodeng gentle ICMP ECHO_REQUEST. I'm quite blocked trying to understand how you'd like to organize tests. See above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truly apologize... This message is inundated in too many notifications...
Yes. That sounds right. But the thing I worry about is that this way it creates too many temporary persistent disks on our GCE and is costly... For initial stage, let's just create one test to make sure it work. We can think of ideas to poll those PVs later?
test/e2e/basic_test.go
Outdated
@@ -40,7 +40,7 @@ func TestCreateCluster(t *testing.T) { | |||
} | |||
}() | |||
|
|||
if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 6, testEtcd); err != nil { | |||
if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 20, testEtcd); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since creating and mounting PVs takes some time I just blindly increased the retries. Better ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change any timeout in test.
All test timeout is delicately tuned because we understand why we allocate this much time and have expectation how long each operation takes.
Let's create a separate set of test for etcd pv scenario.
Additionally, we will probably poll some PVs before test in the future. But we don't need to worry about this right now.
pkg/cluster/cluster.go
Outdated
|
||
// TODO: We set timeout to 60s here since PVC binding could take up to 60s for GCE/PD. See https://github.com/kubernetes/kubernetes/issues/40972 . | ||
// Change the wait time once there are official p99 SLA. | ||
err = retryutil.Retry(4*time.Second, 15, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is copied from https://github.com/coreos/etcd-operator/blob/master/pkg/util/k8sutil/backup.go#L63. I think is shoul unified but this will need some more refactoring. One key difference is that etcd pods and (with this patch) PVCs are created (the api call) here while for backups they are created inside pkg/util/k8sutil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not wait PVC ready here.
It should reconcile until PVC is ready before creating the pod.
pkg/cluster/cluster.go
Outdated
} | ||
|
||
// TODO: We set timeout to 60s here since PVC binding could take up to 60s for GCE/PD. See https://github.com/kubernetes/kubernetes/issues/40972 . | ||
// Change the wait time once there are official p99 SLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wait for pvc to be created or should we just define them and then continue creating the pod? It'll work also without waiting and the pod will be scheduled only when the pvc goes in bound state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not wait PVC ready here.
It should reconcile until PVC is ready before creating the pod.
"etcd_node": m.Name, | ||
"etcd_cluster": clusterName, | ||
"app": "etcd", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hongchaodeng I remember that you said to just set the owner ref without using labels and I agree. I just noticed that with the current gc logic this could lead to a lot of logged warnings (https://github.com/coreos/etcd-operator/pull/1434/files#diff-40a34818cc02bcfb053e6102e57f3177R173 ) since every pvc in the namespace is inspected. So I added the same labels used for the other resources just to do an initial filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's the context.
Re. labels, use:
"etcd_cluster": clusterName,
"app": "etcd",
is good enough
pkg/apis/etcd/v1beta2/cluster.go
Outdated
@@ -27,6 +27,8 @@ import ( | |||
const ( | |||
defaultBaseImage = "quay.io/coreos/etcd" | |||
defaultVersion = "3.1.8" | |||
|
|||
minPVSize = 512 // 512MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minPVSizeInMB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minPodPV...
pkg/apis/etcd/v1beta2/cluster.go
Outdated
@@ -206,6 +207,11 @@ func (c *ClusterSpec) Validate() error { | |||
return errors.New("spec: pod labels contains reserved label") | |||
} | |||
} | |||
if c.Pod.PV != nil { | |||
if c.Pod.PV.VolumeSizeInMB < minPVSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we just set it back to min instead of erroring it? i forgot about how k8s resource limit works here for the similar case though. probably we want the consistency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a default sounds a better option to prevent user mistakes.
Additionally, we should also write docs to recommend setting limitrange to require minimum storage:
https://kubernetes.io/docs/tasks/administer-cluster/limit-storage-consumption/#limitrange-to-limit-requests-for-storage
pkg/cluster/cluster.go
Outdated
@@ -404,7 +411,12 @@ func (c *Cluster) startSeedMember(recoverFromBackup bool) error { | |||
SecureClient: c.isSecureClient(), | |||
} | |||
ms := etcdutil.NewMemberSet(m) | |||
if err := c.createPod(ms, m, "new", recoverFromBackup); err != nil { | |||
if c.UsePodPV() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShouldUsePodPV or IsPVEnabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsPodPVEnabled ?
pkg/cluster/cluster.go
Outdated
return fmt.Errorf("failed to create persistent volume claim for seed member (%s): %v", m.Name, err) | ||
} | ||
} | ||
if err := c.createPod(ms, m, "new", recoverFromBackup, c.UsePodPV()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to pass in usePodPV here. the receiver is still onc
.
pkg/garbagecollection/gc.go
Outdated
@@ -158,3 +161,28 @@ func (gc *GC) collectDeployment(option metav1.ListOptions, runningSet map[types. | |||
|
|||
return nil | |||
} | |||
|
|||
func (gc *GC) collectPVCs(option metav1.ListOptions, runningSet map[types.UID]bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when this lands, we probably already move to k8s 1.8, which supports crd GC natively. so we do not need this thing anymore.
/cc @hongchaodeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We will rm -rf
entire custom GC code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
hack/test
Outdated
go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE} | ||
# Run tests with PV support enabled | ||
PV_TEST=true go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a separate set of test.
For example, test/e2e/etcd_on_pv_test.go .
We will pool some PV ahead to reduce time overhead and cost on cloud.
pkg/debug/debug_logger.go
Outdated
@@ -75,6 +75,10 @@ func (dl *DebugLogger) LogPodDeletion(podName string) { | |||
dl.fileLogger.Infof("deleted pod (%s)", podName) | |||
} | |||
|
|||
func (dl *DebugLogger) LogPVCDeletion(pvcName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add this logging.
The debugLogger was built for self-hosted hack..
pkg/apis/etcd/v1beta2/cluster.go
Outdated
@@ -206,6 +207,11 @@ func (c *ClusterSpec) Validate() error { | |||
return errors.New("spec: pod labels contains reserved label") | |||
} | |||
} | |||
if c.Pod.PV != nil { | |||
if c.Pod.PV.VolumeSizeInMB < minPVSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a default sounds a better option to prevent user mistakes.
Additionally, we should also write docs to recommend setting limitrange to require minimum storage:
https://kubernetes.io/docs/tasks/administer-cluster/limit-storage-consumption/#limitrange-to-limit-requests-for-storage
pkg/apis/etcd/v1beta2/cluster.go
Outdated
@@ -27,6 +27,8 @@ import ( | |||
const ( | |||
defaultBaseImage = "quay.io/coreos/etcd" | |||
defaultVersion = "3.1.8" | |||
|
|||
minPVSize = 512 // 512MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minPodPV...
pkg/cluster/cluster.go
Outdated
@@ -404,7 +411,12 @@ func (c *Cluster) startSeedMember(recoverFromBackup bool) error { | |||
SecureClient: c.isSecureClient(), | |||
} | |||
ms := etcdutil.NewMemberSet(m) | |||
if err := c.createPod(ms, m, "new", recoverFromBackup); err != nil { | |||
if c.UsePodPV() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsPodPVEnabled ?
pkg/cluster/cluster.go
Outdated
|
||
// TODO: We set timeout to 60s here since PVC binding could take up to 60s for GCE/PD. See https://github.com/kubernetes/kubernetes/issues/40972 . | ||
// Change the wait time once there are official p99 SLA. | ||
err = retryutil.Retry(4*time.Second, 15, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not wait PVC ready here.
It should reconcile until PVC is ready before creating the pod.
pkg/util/k8sutil/k8sutil.go
Outdated
return pvc | ||
} | ||
|
||
func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs api.ClusterSpec, usePVC bool, owner metav1.OwnerReference) *v1.Pod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add a new param.
Instead, add a new func AddPVCIntoPod()
to compose the spec.
pkg/garbagecollection/gc.go
Outdated
@@ -158,3 +161,28 @@ func (gc *GC) collectDeployment(option metav1.ListOptions, runningSet map[types. | |||
|
|||
return nil | |||
} | |||
|
|||
func (gc *GC) collectPVCs(option metav1.ListOptions, runningSet map[types.UID]bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We will rm -rf
entire custom GC code :)
pkg/util/etcdutil/member.go
Outdated
@@ -70,6 +70,10 @@ func (m *Member) PeerURL() string { | |||
return fmt.Sprintf("%s://%s:2380", m.peerScheme(), m.Addr()) | |||
} | |||
|
|||
func (m *Member) PVCName() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really etcd related method..
It should be in k8sutil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/e2e/basic_test.go
Outdated
@@ -40,7 +40,7 @@ func TestCreateCluster(t *testing.T) { | |||
} | |||
}() | |||
|
|||
if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 6, testEtcd); err != nil { | |||
if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 20, testEtcd); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change any timeout in test.
All test timeout is delicately tuned because we understand why we allocate this much time and have expectation how long each operation takes.
Let's create a separate set of test for etcd pv scenario.
Additionally, we will probably poll some PVs before test in the future. But we don't need to worry about this right now.
@sgotti Is there anything i can help with? |
@mrIncompetent |
@hongchaodeng @mrIncompetent Sorry, I've ben busy with other stuff and lost your comments notifications. I'm fine if you want to take over or I'll be able to update it in the next days. |
@sgotti @mrIncompetent |
56019ef
to
4d30f6c
Compare
this patch implements part 1 of the etcd data on persistent volumes design. when pod pvsource is defined in the spec it'll create a PVC for every etcd member and use it as the volume for etcd data. pvc without a member will be removed during the reconcile.
4d30f6c
to
cc4ca65
Compare
@hongchaodeng Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work.
Looks very good!
Just some minor code issues.
I will give it a try later over the weekend too and provide more feedback.
@@ -37,6 +37,15 @@ func NewCluster(genName string, size int) *api.EtcdCluster { | |||
} | |||
} | |||
|
|||
func AddPV(c *api.EtcdCluster, storageClass string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddPVToCluster
@@ -415,6 +422,11 @@ func (c *Cluster) startSeedMember(recoverFromBackup bool) error { | |||
SecureClient: c.isSecureClient(), | |||
} | |||
ms := etcdutil.NewMemberSet(m) | |||
if c.IsPodPVEnabled() { | |||
if err := c.createPVC(m); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated in two places. Can you combine it with createPod() into a new function? Maybe createResource?
if !k8sutil.IsKubernetesResourceNotFoundError(err) { | ||
return err | ||
} | ||
if c.isDebugLoggerEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you discard this path:
if c.isDebugLoggerEnabled() {...}
Just ignore isDebugLoggerEnabled
now
func (c *Cluster) removePVC(name string) error { | ||
ns := c.cluster.Namespace | ||
err := c.config.KubeCli.Core().PersistentVolumeClaims(ns).Delete(name, nil) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err !=nil && !NotFound(err){...
"etcd_node": m.Name, | ||
"etcd_cluster": clusterName, | ||
"app": "etcd", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's the context.
Re. labels, use:
"etcd_cluster": clusterName,
"app": "etcd",
is good enough
@@ -537,6 +575,29 @@ func (c *Cluster) pollPods() (running, pending []*v1.Pod, err error) { | |||
return running, pending, nil | |||
} | |||
|
|||
func (c *Cluster) pollPVCs() (pvcs []*v1.PersistentVolumeClaim, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc?
@@ -32,7 +32,7 @@ import ( | |||
// reconcile reconciles cluster current state to desired state specified by spec. | |||
// - it tries to reconcile the cluster to desired size. | |||
// - if the cluster needs for upgrade, it tries to upgrade old member one by one. | |||
func (c *Cluster) reconcile(pods []*v1.Pod) error { | |||
func (c *Cluster) reconcile(pods []*v1.Pod, pvcs []*v1.PersistentVolumeClaim) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change doc?
@@ -209,6 +214,14 @@ func newEtcdServiceManifest(svcName, clusterName, clusterIP string, ports []v1.S | |||
return svc | |||
} | |||
|
|||
func AddEtcdVolumeToPod(pod *v1.Pod, m *etcdutil.Member, usePVC bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc?
@@ -228,6 +241,36 @@ func NewSeedMemberPod(clusterName string, ms etcdutil.MemberSet, m *etcdutil.Mem | |||
return pod | |||
} | |||
|
|||
func NewPVC(m *etcdutil.Member, cs api.ClusterSpec, clusterName, namespace string, owner metav1.OwnerReference) *v1.PersistentVolumeClaim { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc?
"github.com/coreos/etcd-operator/test/e2e/framework" | ||
) | ||
|
||
func TestCreateClusterWithPV(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc?
}, | ||
}, | ||
Spec: v1.PersistentVolumeClaimSpec{ | ||
StorageClassName: &cs.Pod.PV.StorageClass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since storageClassName
is optional in cluster specification, unspecified storageClassName
in cluster spec leads to storageClassName = ""
in PVC specification. In case storageClassName
is not specified in cluster specification, it should be removed from PVC to use default storageClass. Otherwise there won't be option for user to make use of default storage class. Please have look at this https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize empty means default.
If so, the correct thing to do is to change the storageclads field into a pointer. But This is not urgent here. We can fix It after This pr
Hi @sgotti |
@hongchaodeng I don't have a lot of spare time in these days so feel free to finish it. Thanks. |
Mind if I close this PR in favor of #1861? |
Done in #1861 |
this patch implements part 1 of the etcd data on persistent volumes design.
when pod pvsource is defined in the spec it'll create a PVC for every etcd
member and use it as the volume for etcd data.
pvc without a member will be removed during the reconcile.
NOTES