Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORTANT: Missing ssl_ca_file in Spilo Config #435

Closed
MarkCupitt opened this issue Jan 10, 2024 · 11 comments
Closed

IMPORTANT: Missing ssl_ca_file in Spilo Config #435

MarkCupitt opened this issue Jan 10, 2024 · 11 comments

Comments

@MarkCupitt
Copy link

MarkCupitt commented Jan 10, 2024

Patroni supports the inclusion of ssl_ca_file so that PostgreSQL can use custom tls certs with root cert validation (we require it for teleport db access, our database is airgapped, so we proxy to it via teleport)

patroni/features/environment.py supports ssl_ca_file
patroni/patroni/postgresql/validator.py supports ssl_ca_file
helm-charts/charts/timescaledb-single/values.yaml supports ssl_ca_file

However

timescaledb-docker-ha/scripts/configure_spilo.py omits ssl_ca_file

which means that postgres never gets the ca and tls fails for any session that requires tls validation

This makes TimescaleDb unusable in many scenarios

@MarkCupitt
Copy link
Author

I'm thinking that adding one line may solve this ...

timescaledb-docker-ha/scripts/configure_spilo.py

    ssl: 'on'
    ssl_cert_file: {{SSL_CERTIFICATE_FILE}}
    ssl_key_file: {{SSL_PRIVATE_KEY_FILE}}
    ssl_ca_file: {{{postgresql.parameters.ssl_ca_file}}}
    shared_preload_libraries: 'bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon'

There may need to be some logic somewhere to exclude it if its not supplied, not sure what postgreSql will do if its set to empty

@MarkCupitt
Copy link
Author

So, I built a custom image after making the above changes using:

PG_MAJOR="15" PG_VERSIONS="15.5" TIMESCALEDB_VERSIONS="2.13.0" make build

Tagged it and uploaded it to my Dockerhub, docker pull markcupitt/timescaledb:v1.0.0, loaded it to our Kubernetes cluster and can confirm I am now able to connect via SSL as originally expected.

I will next try it by removing the ca file from the pod, to test what PostgreSQL does when no ca is specified, but this ends up in the postgres parameters

ssl_ca_file: 

@MarkCupitt
Copy link
Author

MarkCupitt commented Jan 10, 2024

I did this test, and can confirm that PostgreSQL threw a fit :) with the following log entry

2024-01-10 04:11:52 UTC [32]: [659e1908.20-1] @,app= [F0000] FATAL:  could not load root certificate file "/etc/certificate/ca.crt": SSL error code 2147483650

So some logic in there somewhere is obviously required to ensure that parameter is populated with the correct ca or not included

Possibly in patroni/features/environment.py around line 218, remove the entry for ssl_ca_cert if empty or the file is non existent

@MarkCupitt
Copy link
Author

MarkCupitt commented Jan 10, 2024

Some additional issues with the Helm chart can be found at:

timescale/helm-charts#641
timescale/helm-charts#642
timescale/helm-charts#639

@alexeyklyukin
Copy link
Member

@MarkCupitt one may need to add support for the ssl_ca_file into the helm chart, but configure_spilo.py is a wrong place to add it. It is still there for compatibility reasons with the zalando postgres-operator, although that compatibility is going to be removed at some point in the future, since the original spilo image will probably drift away from this one, and Patroni supports new parameters that the configure_spilo.py here is not aware of.

I think if one wants to retain this compatibility, the right actions would be moving conifgure_spilo.py and augment_patroni_configuraiton.py into the scripts configmap in the helm chart, decoupling it from this docker image.

@MarkCupitt
Copy link
Author

MarkCupitt commented Jan 12, 2024

@alexeyklyukin If we include the ssl_ca_file in the helm chart, which we can in the postgres section, it IS included in the configmap, however it gets stripped out and does not end up in the postgres config in the container, I have verified that.

If I add the ssl_ca_file into a running container postgres parameters and reload the config, it works as expected.

SO whatever script runs at container init is the point of failure, I presumed it was the configure_spilo.py, in the container as part of the patroni deployment process but based on what you say, there must be some other process that runs????

Do you know what that is, so I can look at a fix, I have only been able to find the configure_spilo script in the container that seems to access that parameter... I did a global repo search

I don't think this is a helm issue as the chart does what it is supposed to and ssl_ca_file does end up in the config map, and is mounted, it breaks down after that point

@alexeyklyukin
Copy link
Member

@MarkCupitt the container init is at

#!/bin/bash
function log {
echo "$(date '+%Y-%m-%d %H:%M:%S') - bootstrap - $1"
}
# Making sure the data directory exists and has the right permission
install -m 0700 -d "${PGDATA}"
# pgBackRest container
[ "${K8S_SIDECAR}" == "pgbackrest" ] && exec /pgbackrest_entrypoint.sh
# Spilo is the original Docker image containing Patroni. The image uses
# some scripts to convert a SPILO_CONFIGURATION into a configuration for Patroni.
# At some point, we want to probably get rid of this script and do all this ourselves.
# For now, if the environment variable is set, we consider that a feature flag to use
# the original Spilo configuration script
[ ! -z "${SPILO_CONFIGURATION}" ] && {
python3 /scripts/configure_spilo.py patroni patronictl certificate
# The current postgres-operator does not pass on all the variables set by the Custom Resource.
# We need a bit of extra work to be done
# Issue: https://github.com/zalando/postgres-operator/issues/574
python3 /scripts/augment_patroni_configuration.py /home/postgres/postgres.yml
}
if [ -f "${PGDATA}/postmaster.pid" ]; then
# the postmaster will refuse to start if the pid of the pidfile is currently
# in use by the same OS user. This protection mechanism however is not strict
# enough in a container environment, as we only have the pids in our own namespace.
# The Volume containing the data directory could accidentally be mounted
# inside multiple containers, so relying on visibility of the pid is not enough.
#
# There is only 1 way for us to communicate to the other postmaster (in another container?)
# on the same $PGDATA: by removing the pidfile.
#
# The other postmaster will shutdown immediately as soon as it determines that its
# pidfile has been removed. This is a Very Good Thing: it prevents multiple postmasters
# on the same directory, even in a container environment.
# See also https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7e2a18)
#
# The downside of this change is that it will delay the startup of a crashed container;
# as we're dealing with data, we'll choose correctness over uptime in this instance.
log "Removing stale pidfile ..."
rm "${PGDATA}/postmaster.pid"
log "Sleeping a little to ensure no other postmaster is running anymore"
sleep 65
fi
exec patroni /home/postgres/postgres.yml

unless you specify the SPILO_CONFIGURATION environment variable, which the helm chart doesn't,
you'll get straight to Patroni configuration defined in the values.yaml of the helm chart.

In the helm chart values, you have the the certificateSecretName in the helm chart, allowing your to override the secret for mounting certificates into the container, and patroni.bootstrap.dcs.postgresql.parameters where you can reference ssl_ca_file, pointing it to the content you've put into the certificateSecretName secret.

The default behavior, if you don't specify the custom secret in the certificateSecretName, is to write a self-signed one at https://github.com/timescale/helm-charts/blob/7ded6b654c956a3f6dc119d90b47a0262eba600e/charts/timescaledb-single/templates/secret-certificate.yaml

@MarkCupitt
Copy link
Author

@alexeyklyukin in a Kubernetes environment adding a Ca in certificateSecretName breaks Patroni. This certificate is used to talk to the Kubernetes API and of course. on a custom cert, kubernetes does not know about it. Patroni DCS uses teh Kubernetes API for its object store. Its the first thing we tried.

The self signed cert does happen, but that cert cannot be used with Teleport per our use case we originally stated

ssl_ca_file in postgres parameters is REMOVEd by patroni, I gave you the code where that happened

This is going in circles .. we will have to find alternatives I guess .. an adjust manage our own builds ..

Bottom line is that the container needs to be fixed, we proved it and have it running in our own build, I gave the changes we made to get it working .. the only consideration is if a ca file is not specified, then some logic is required to remove the ssl_ca_file form the config or postgres will break, I suspect that people believe that certificateSecretName is the way to do it, its not, it breaks Patroni in a Kubernetes environment as I stated above

@MarkCupitt
Copy link
Author

@alexeyklyukin our decision at this stage is that until timescaledb can fix the containers so we can specify a ca, we have no choice but to maintain our own builds (Which we are using now) ... We will drop the helm deployment and use a manifest based deployment using KLUCTL as the deployment tool. We will make our code and configs available in our GIthub Repos for anyone else who needs it, once we have it looking respectable

We will happily go back to official builds once the container can be given a Ca in the postgres parms section of the patroni config and it's not removed, unfortunately, we need to progress our project, our timelines are slipping significantly now

@MarkCupitt
Copy link
Author

@alexeyklyukin if you feel that there is nothing wrong with the container or the patroni script , then please close this ticket, The container is unusable for us the way it is right now ..

However I really hope you guys will look at this issue, replicate it, and fix it so we can go back to the official builds OR tell us what we need to do to supply our own ca if its is indeed NOT the patroni script in the container that is the root cause of this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants