-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Patroni Kubernetes native #500
Conversation
members status is stored in pods annotations.metadata Other structure has slightly changed: leader and optime/leader are merged into configmap: cluster-name-leader initialize and config are merged into configmap: cluster-name-config failover and sync are stayed as it is. Unfortunately kubernetes doesn't provide API for atomic deletes, therefore we just empty metadata instead of deleting objects.
@jberkus, I think you should really try it. It is using Kubernetes API and config maps to store cluster state. Here you can find Dockerfile and Kubernetes manifest to deploy it: https://github.com/zalando/patroni/tree/feature/k8s/kubernetes |
Will test! FWIW, it's possible Kube will add leader elections in the future. |
This is not a critical bug, because `attempt_to_acquire_leader` method was still returning False in this case.
in addition to that implement additional checks around manual failover and recover when synchronous_mode is enabled
* possibility to specify client certs and cacert * possibility to specify token * compatibility with python-consul-0.7.1
And set correct postgres state in a pause mode
The lates one has some problems with none values received instead of empty list: kubernetes-client/python#376
Indeed, this is the problem of kubernetes 4.0.0 module. I've updated requirements.txt and put fixed version there: kubernetes==3.0.0 |
I built a new Spilo image as explained above. I hope this was the right thing to do. The result is this:
Let me know if this is not the right place to further discuss this. I don't want to abuse this PR. Would you be willing to further assist me? I could create an issue or you can find me on Kubernetes Slack. |
Oh, I've told you to build it with "--build-arg DEMO=true" is used to build an image with postgres 10 only and without a lot of heavy stuff. It is good enough to try it with minikube for example, but it is not for production, because there is wal-e inside. And you hit a bug that $PATH wasn't propagated to patroni and it failed to run pg_ctl initdb. Normal spilo image contains postgres 9.3, 9.4, 9.5, 9.6 and 10. |
I built the new image without demo and tried it with Postgres 10 and 9.6. Everything starts up fine with now with both, |
I think I know what the problem is. At some moment we started creating Services with a "named" port: apiVersion: v1
kind: Service
metadata:
name: &cluster_name patronidemo
labels:
application: spilo
version: *cluster_name
spec:
type: ClusterIP
ports:
- port: 5432
targetPort: 5432
name: postgresql # XXX This commit explain why: zalando/spilo@2be341a#diff-8c54fa1e5677a832585d18f396619701 If name of Service and Endpoint doesn't match - service will not work. I think in your case you've created Service with no name assigned for port=5432 |
Thanks, that was the missing piece. It works for both modes ( |
I also experimented with parallel pod management and rolling updates. Do you see a problem with this? |
Endpoints. Otherwise there is a race condition: #536
I don't think that it will "form correctly". Patroni will notice that there is a Database clusters are stateful and usually you don't delete-create them all the time. There is not much we can improve.
I've never played with rolling upgrades so far, but I think it might be dangerous. It is very important to not terminate the next pod until the previous one become healthy enough (started streaming from master and replication lag close to 0). |
I've seen the operator project and was wondering which way to go. I'm looking for a solution without an additional DCS like Etcd. Is this also possible with the operator? |
Postgres-operator doesn't care what DCS is used by Patroni. It just passes some environment variables to Spilo. |
So, some testing questions;
|
The latest spilo image supporting kube-native is In order to enable Kubernetes api for leader election you should set |
OK, so KUBERNETES_USE_CONFIGMAP is in spilo but not in upstream patroni? |
Yes, In Patroni kubernetes configs are different:
|
OK, trying use_endpoints in Openshift, will report back. Clearly I need to write a config doc for this. |
This is AFAIK already running in production, can we merge this PR? |
@jberkus - yes I got a version of this working with OpenShift. I actually did use heavily modified Spilo because I wanted the archiving and backup features from there. There is plenty of issues to resolve when building the image, for example initdb will fail because getpwnam() doesn't work because the container user does not exist, everything needs to run as the fake root user, anything that will be modified from the container needs g+rw permissions set on it, setuid will not work at all, so cron daemon needs to be replaced. I used the config map based approach, the race condition is an acceptable risk for now. I haven't tried to integrate the latest version, but that is just because I have been busy with other tasks. |
👍 |
To date, this branch has done well in all of my testing. I have yet to hit a specific bug with it. See follow-up issue ... |
👍 |
In order to be able to find all objects related to our Patroni cluster we use
labels
andlabelSelector
.patroni.yaml
Unfortunately Kubernetes API doesn't provide possibility to expire objects, but provides only compare-and-set functionality, therefore we had to implement leader election by periodically updating annotations of
<scope>-leader
ConfigMap object.Basic idea is taken from https://github.com/kubernetes/client-go/tree/master/tools/leaderelection.
Every node in the cluster periodically checks the annotations of
<scope>-leader
ConfigMap object. If annotations were changed, that means we have a leader, if annotations weren't changed duringttl
seconds, that means cluster doesn't have a leader.List of ConfigMaps Patroni is working with:
initialize
andconfig
keys are stored as annotations of<scope>-config
ConfigMapleader
andoptime/leader
keys are stored as annotations of<scope>-leader
ConfigMap
failover
key is stored as annotations of<scope>-failover
ConfigMap
sync
key is stored as annotations of<scope>-sync
ConfigMap
Open questions: