Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace etcd operator with StatefulSet #50

Open
etiennedi opened this issue Feb 26, 2019 · 12 comments
Open

Replace etcd operator with StatefulSet #50

etiennedi opened this issue Feb 26, 2019 · 12 comments
Assignees

Comments

@etiennedi
Copy link
Member

As described in #40 , the etcd operator is unfortunately not suitable for production use. A simple StatefulSet (I think there is such a chart in incubator) might be better: coreos/etcd-operator#1323 (comment)

@idcrosby
Copy link
Contributor

The friendly folk at Bitnami have a Helm chart for etcd which looks promising (and uses a stateful set). I'm testing it out now. https://github.com/bitnami/charts/tree/master/bitnami/etcd

@idcrosby
Copy link
Contributor

idcrosby commented Mar 4, 2019

I have a work in progress PR here: #53

The verification scripts don't seem to be running properly. It just hangs:

Downloading contextionary vocabulary... succesfully downloaded.
Reading contextionary... succesfully parsed contextionary.

All pods are up and running, the weaviate logs don't look super interesting:

2019/03/04 19:19:44 INFO: Temp folder created....
2019/03/04 19:19:44 INFO: Config file found, loading environment...
2019/03/04 19:19:44 INFO: Running in DEBUG-mode.
2019/03/04 19:19:44 INFO: Contextionary loaded from disk.
2019/03/04 19:19:44 INFO: No network configured, not joining one.

Any thoughts on what might be happening or how to debug @etiennedi ?

@etiennedi
Copy link
Member Author

I'll take a look now, @idcrosby. Will let you know what I find.

@etiennedi etiennedi self-assigned this Mar 5, 2019
@etiennedi
Copy link
Member Author

Simply from looking at the build logs my assumption would be that something around the distributed locking doesn't work as intended.

This would lead to weaviate not starting up the http server properly which would in turn lead the verification script to not be able to download the swagger json.

I'll check out the PR branch, build the helm chart and try and apply it to a minikube cluster. Maybe I can reproduce it. If not, we'd have to trigger a "real cluster" build without the destroy step and see the state there.

@idcrosby
Copy link
Contributor

idcrosby commented Mar 5, 2019

@etiennedi I can quickly bring up a real cluster with this setup and debug, how could I verify the distributed locking setup?

@etiennedi
Copy link
Member Author

etiennedi commented Mar 5, 2019

My first approach would simply be to check the logs for weaviate to see how far it gets. Additionally, the locks are regular key-value entries in etcd so you could do a simple exec into the etcd container and run ETCDCTL_API=3 ectdctl get --prefix /w (All keys start with /weaviate, so with the --prefix this will effectively dump all entries).

@etiennedi
Copy link
Member Author

Good idea with the real cluster, the resource requirements are definitely quite big, so minikube with the default VM doesn't work. I'll let you run the setup and instead watch this issue closely for your updates. Thanks.

@idcrosby
Copy link
Contributor

idcrosby commented Mar 5, 2019

Etcd was configured with client auth enabled, but weaviate is not configured to authenticate (certs) with etcd.

I think for our use case (etcd not being exposed externally and not storing any sensitive data) we can disable client authentication. I'm configuring this now.

Also, would be a good idea to have weaviate log an error if it cannot connect to etcd

cc @etiennedi

@etiennedi
Copy link
Member Author

etiennedi commented Mar 5, 2019

I think for our use case (etcd not being exposed externally and not storing any sensitive data) we can disable client authentication. I'm configuring this now.

Agreed.

This reminds me, I noticed some of our k8s Services are of type LoadBalancer, which means if you apply the chart on a default GKE cluster, some endpoints will imemdiately be publicly available which might not be the best default. I'll open a separate issue for this.

Also, would be a good idea to have weaviate log an error if it cannot connect to etcd

What was the current behavior? I would have expected it to fail with an error because of https://github.com/creativesoftwarefdn/weaviate/blob/develop/restapi/configure_weaviate.go#L680-L683

@idcrosby
Copy link
Contributor

idcrosby commented Mar 5, 2019

@etiennedi weaviate stays running and doesn't log any errors, only thing in the logs is:

2019/03/04 19:19:44 INFO: Temp folder created....
2019/03/04 19:19:44 INFO: Config file found, loading environment...
2019/03/04 19:19:44 INFO: Running in DEBUG-mode.
2019/03/04 19:19:44 INFO: Contextionary loaded from disk.
2019/03/04 19:19:44 INFO: No network configured, not joining one.

@etiennedi
Copy link
Member Author

Interesting. So client creation doesn't error, but it simply never acquires the lock. Weird. But thanks for noticing. I'll create a separate issue over in creativesoftwarefdn/weaviate.

@idcrosby
Copy link
Contributor

idcrosby commented Mar 5, 2019

Looks like everything is working now: https://travis-ci.com/SeMI-network/weaviate-infra/builds/103239676

I'll remove the WIP from the PR and assign to you @etiennedi to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants