Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

*: implement PersistentVolume for etcd data design part 1 #1434

Closed
wants to merge 1 commit into from

Conversation

sgotti
Copy link
Member

@sgotti sgotti commented Sep 22, 2017

this patch implements part 1 of the etcd data on persistent volumes design.

when pod pvsource is defined in the spec it'll create a PVC for every etcd
member and use it as the volume for etcd data.

pvc without a member will be removed during the reconcile.

NOTES

  • As discussed during the design phase I haven't changed to pod restart policy to Always. So beside putting data on a PV this patch shoudln't change any other logic and behavior.

hack/test Outdated
go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
# Run tests with PV support enabled
PV_TEST=true go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To see how the tests works on your jenkins I just rerun them enabling pv support. Do you have better ideas on how to do this without duplicating the code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a separate set of test.
For example, test/e2e/etcd_on_pv_test.go .
We will pool some PV ahead to reduce time overhead and cost on cloud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this part doesn't change the logic but only places etcd data on a PV I was thinking to redo all the tests with PV enabled. Using a different test files means a lot of duplicated code and maintenance burden. Perhaps we could have just different timeouts for the non PV and PV cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongchaodeng gentle ICMP ECHO_REQUEST. I'm quite blocked trying to understand how you'd like to organize tests. See above comment.

Copy link
Member

@hongchaodeng hongchaodeng Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly apologize... This message is inundated in too many notifications...

Yes. That sounds right. But the thing I worry about is that this way it creates too many temporary persistent disks on our GCE and is costly... For initial stage, let's just create one test to make sure it work. We can think of ideas to poll those PVs later?

@@ -40,7 +40,7 @@ func TestCreateCluster(t *testing.T) {
}
}()

if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 6, testEtcd); err != nil {
if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 20, testEtcd); err != nil {
Copy link
Member Author

@sgotti sgotti Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since creating and mounting PVs takes some time I just blindly increased the retries. Better ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change any timeout in test.
All test timeout is delicately tuned because we understand why we allocate this much time and have expectation how long each operation takes.

Let's create a separate set of test for etcd pv scenario.
Additionally, we will probably poll some PVs before test in the future. But we don't need to worry about this right now.


// TODO: We set timeout to 60s here since PVC binding could take up to 60s for GCE/PD. See https://github.com/kubernetes/kubernetes/issues/40972 .
// Change the wait time once there are official p99 SLA.
err = retryutil.Retry(4*time.Second, 15, func() (bool, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is copied from https://github.com/coreos/etcd-operator/blob/master/pkg/util/k8sutil/backup.go#L63. I think is shoul unified but this will need some more refactoring. One key difference is that etcd pods and (with this patch) PVCs are created (the api call) here while for backups they are created inside pkg/util/k8sutil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not wait PVC ready here.
It should reconcile until PVC is ready before creating the pod.

}

// TODO: We set timeout to 60s here since PVC binding could take up to 60s for GCE/PD. See https://github.com/kubernetes/kubernetes/issues/40972 .
// Change the wait time once there are official p99 SLA.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for pvc to be created or should we just define them and then continue creating the pod? It'll work also without waiting and the pod will be scheduled only when the pvc goes in bound state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not wait PVC ready here.
It should reconcile until PVC is ready before creating the pod.

"etcd_node": m.Name,
"etcd_cluster": clusterName,
"app": "etcd",
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongchaodeng I remember that you said to just set the owner ref without using labels and I agree. I just noticed that with the current gc logic this could lead to a lot of logged warnings (https://github.com/coreos/etcd-operator/pull/1434/files#diff-40a34818cc02bcfb053e6102e57f3177R173 ) since every pvc in the namespace is inspected. So I added the same labels used for the other resources just to do an initial filtering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the context.
Re. labels, use:


"etcd_cluster": clusterName,
"app":          "etcd",

is good enough

@@ -27,6 +27,8 @@ import (
const (
defaultBaseImage = "quay.io/coreos/etcd"
defaultVersion = "3.1.8"

minPVSize = 512 // 512MiB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minPVSizeInMB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minPodPV...

@@ -206,6 +207,11 @@ func (c *ClusterSpec) Validate() error {
return errors.New("spec: pod labels contains reserved label")
}
}
if c.Pod.PV != nil {
if c.Pod.PV.VolumeSizeInMB < minPVSize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we just set it back to min instead of erroring it? i forgot about how k8s resource limit works here for the similar case though. probably we want the consistency here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a default sounds a better option to prevent user mistakes.

Additionally, we should also write docs to recommend setting limitrange to require minimum storage:
https://kubernetes.io/docs/tasks/administer-cluster/limit-storage-consumption/#limitrange-to-limit-requests-for-storage

@@ -404,7 +411,12 @@ func (c *Cluster) startSeedMember(recoverFromBackup bool) error {
SecureClient: c.isSecureClient(),
}
ms := etcdutil.NewMemberSet(m)
if err := c.createPod(ms, m, "new", recoverFromBackup); err != nil {
if c.UsePodPV() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShouldUsePodPV or IsPVEnabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsPodPVEnabled ?

return fmt.Errorf("failed to create persistent volume claim for seed member (%s): %v", m.Name, err)
}
}
if err := c.createPod(ms, m, "new", recoverFromBackup, c.UsePodPV()); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass in usePodPV here. the receiver is still onc.

@@ -158,3 +161,28 @@ func (gc *GC) collectDeployment(option metav1.ListOptions, runningSet map[types.

return nil
}

func (gc *GC) collectPVCs(option metav1.ListOptions, runningSet map[types.UID]bool) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgotti

when this lands, we probably already move to k8s 1.8, which supports crd GC natively. so we do not need this thing anymore.

/cc @hongchaodeng

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
We will rm -rf entire custom GC code :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

hack/test Outdated
go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
# Run tests with PV support enabled
PV_TEST=true go test "./test/e2e/" -run "$E2E_TEST_SELECTOR" -timeout 30m --race --kubeconfig $KUBECONFIG --operator-image $OPERATOR_IMAGE --namespace ${TEST_NAMESPACE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a separate set of test.
For example, test/e2e/etcd_on_pv_test.go .
We will pool some PV ahead to reduce time overhead and cost on cloud.

@@ -75,6 +75,10 @@ func (dl *DebugLogger) LogPodDeletion(podName string) {
dl.fileLogger.Infof("deleted pod (%s)", podName)
}

func (dl *DebugLogger) LogPVCDeletion(pvcName string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add this logging.
The debugLogger was built for self-hosted hack..

@@ -206,6 +207,11 @@ func (c *ClusterSpec) Validate() error {
return errors.New("spec: pod labels contains reserved label")
}
}
if c.Pod.PV != nil {
if c.Pod.PV.VolumeSizeInMB < minPVSize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a default sounds a better option to prevent user mistakes.

Additionally, we should also write docs to recommend setting limitrange to require minimum storage:
https://kubernetes.io/docs/tasks/administer-cluster/limit-storage-consumption/#limitrange-to-limit-requests-for-storage

@@ -27,6 +27,8 @@ import (
const (
defaultBaseImage = "quay.io/coreos/etcd"
defaultVersion = "3.1.8"

minPVSize = 512 // 512MiB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minPodPV...

@@ -404,7 +411,12 @@ func (c *Cluster) startSeedMember(recoverFromBackup bool) error {
SecureClient: c.isSecureClient(),
}
ms := etcdutil.NewMemberSet(m)
if err := c.createPod(ms, m, "new", recoverFromBackup); err != nil {
if c.UsePodPV() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsPodPVEnabled ?


// TODO: We set timeout to 60s here since PVC binding could take up to 60s for GCE/PD. See https://github.com/kubernetes/kubernetes/issues/40972 .
// Change the wait time once there are official p99 SLA.
err = retryutil.Retry(4*time.Second, 15, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not wait PVC ready here.
It should reconcile until PVC is ready before creating the pod.

return pvc
}

func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs api.ClusterSpec, usePVC bool, owner metav1.OwnerReference) *v1.Pod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add a new param.

Instead, add a new func AddPVCIntoPod() to compose the spec.

@@ -158,3 +161,28 @@ func (gc *GC) collectDeployment(option metav1.ListOptions, runningSet map[types.

return nil
}

func (gc *GC) collectPVCs(option metav1.ListOptions, runningSet map[types.UID]bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
We will rm -rf entire custom GC code :)

@@ -70,6 +70,10 @@ func (m *Member) PeerURL() string {
return fmt.Sprintf("%s://%s:2380", m.peerScheme(), m.Addr())
}

func (m *Member) PVCName() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really etcd related method..
It should be in k8sutil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -40,7 +40,7 @@ func TestCreateCluster(t *testing.T) {
}
}()

if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 6, testEtcd); err != nil {
if _, err := e2eutil.WaitUntilSizeReached(t, f.CRClient, 3, 20, testEtcd); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change any timeout in test.
All test timeout is delicately tuned because we understand why we allocate this much time and have expectation how long each operation takes.

Let's create a separate set of test for etcd pv scenario.
Additionally, we will probably poll some PVs before test in the future. But we don't need to worry about this right now.

@mrIncompetent
Copy link

@sgotti Is there anything i can help with?
I would love to use this feature!

@hongchaodeng
Copy link
Member

@mrIncompetent
Hasn't seen any update for a while.
Would you like to take it over? I'm fine to move forward if you can submit a PR.

@sgotti
Copy link
Member Author

sgotti commented Oct 17, 2017

@hongchaodeng @mrIncompetent Sorry, I've ben busy with other stuff and lost your comments notifications. I'm fine if you want to take over or I'll be able to update it in the next days.

@hongchaodeng
Copy link
Member

@sgotti
No worries.

@mrIncompetent
Just wait for the PR and help with review. Thanks!

@sgotti sgotti force-pushed the pv_etcd_data_part1 branch 3 times, most recently from 56019ef to 4d30f6c Compare October 27, 2017 15:28
this patch implements part 1 of the etcd data on persistent volumes design.

when pod pvsource is defined in the spec it'll create a PVC for every etcd
member and use it as the volume for etcd data.

pvc without a member will be removed during the reconcile.
@sgotti sgotti force-pushed the pv_etcd_data_part1 branch from 4d30f6c to cc4ca65 Compare October 27, 2017 16:46
@sgotti
Copy link
Member Author

sgotti commented Oct 27, 2017

@hongchaodeng Updated

Copy link
Member

@hongchaodeng hongchaodeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work.
Looks very good!
Just some minor code issues.
I will give it a try later over the weekend too and provide more feedback.

@@ -37,6 +37,15 @@ func NewCluster(genName string, size int) *api.EtcdCluster {
}
}

func AddPV(c *api.EtcdCluster, storageClass string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddPVToCluster

@@ -415,6 +422,11 @@ func (c *Cluster) startSeedMember(recoverFromBackup bool) error {
SecureClient: c.isSecureClient(),
}
ms := etcdutil.NewMemberSet(m)
if c.IsPodPVEnabled() {
if err := c.createPVC(m); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated in two places. Can you combine it with createPod() into a new function? Maybe createResource?

if !k8sutil.IsKubernetesResourceNotFoundError(err) {
return err
}
if c.isDebugLoggerEnabled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you discard this path:

if c.isDebugLoggerEnabled() {...}

Just ignore isDebugLoggerEnabled now

func (c *Cluster) removePVC(name string) error {
ns := c.cluster.Namespace
err := c.config.KubeCli.Core().PersistentVolumeClaims(ns).Delete(name, nil)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err !=nil && !NotFound(err){...

"etcd_node": m.Name,
"etcd_cluster": clusterName,
"app": "etcd",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the context.
Re. labels, use:


"etcd_cluster": clusterName,
"app":          "etcd",

is good enough

@@ -537,6 +575,29 @@ func (c *Cluster) pollPods() (running, pending []*v1.Pod, err error) {
return running, pending, nil
}

func (c *Cluster) pollPVCs() (pvcs []*v1.PersistentVolumeClaim, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

@@ -32,7 +32,7 @@ import (
// reconcile reconciles cluster current state to desired state specified by spec.
// - it tries to reconcile the cluster to desired size.
// - if the cluster needs for upgrade, it tries to upgrade old member one by one.
func (c *Cluster) reconcile(pods []*v1.Pod) error {
func (c *Cluster) reconcile(pods []*v1.Pod, pvcs []*v1.PersistentVolumeClaim) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change doc?

@@ -209,6 +214,14 @@ func newEtcdServiceManifest(svcName, clusterName, clusterIP string, ports []v1.S
return svc
}

func AddEtcdVolumeToPod(pod *v1.Pod, m *etcdutil.Member, usePVC bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

@@ -228,6 +241,36 @@ func NewSeedMemberPod(clusterName string, ms etcdutil.MemberSet, m *etcdutil.Mem
return pod
}

func NewPVC(m *etcdutil.Member, cs api.ClusterSpec, clusterName, namespace string, owner metav1.OwnerReference) *v1.PersistentVolumeClaim {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

"github.com/coreos/etcd-operator/test/e2e/framework"
)

func TestCreateClusterWithPV(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

},
},
Spec: v1.PersistentVolumeClaimSpec{
StorageClassName: &cs.Pod.PV.StorageClass,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since storageClassName is optional in cluster specification, unspecified storageClassName in cluster spec leads to storageClassName = "" in PVC specification. In case storageClassName is not specified in cluster specification, it should be removed from PVC to use default storageClass. Otherwise there won't be option for user to make use of default storage class. Please have look at this https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize empty means default.
If so, the correct thing to do is to change the storageclads field into a pointer. But This is not urgent here. We can fix It after This pr

@hongchaodeng
Copy link
Member

Hi @sgotti
Do you have time to address it in next few days? If not, I am having holidays and can help finish the work.

@sgotti
Copy link
Member Author

sgotti commented Nov 27, 2017

@hongchaodeng I don't have a lot of spare time in these days so feel free to finish it. Thanks.

@hongchaodeng
Copy link
Member

Mind if I close this PR in favor of #1861?

@hongchaodeng
Copy link
Member

Done in #1861

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants