Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs Request: configuration for backups and perhaps restoring backups #630

Closed
Bram-Zijp opened this issue Jul 24, 2019 · 12 comments
Closed

Comments

@Bram-Zijp
Copy link

Bram-Zijp commented Jul 24, 2019

I'm struggling to get WAL-E and logical backups to work. I went through all configuration options I could find by looking at the source code and environment variables tickets on GitHub and . I'm trying to connect to Digital Ocean spaces but without success.

I'm getting the following error:

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-push".
        STRUCTURED: time=2019-07-24T05:26:13.333584-00 pid=33
wal_e.blobstore.s3.s3_util WARNING  MSG: WALE_S3_ENDPOINT defined, ignoring AWS_REGION
        HINT: AWS_REGION is only intended for use with AWS S3, and not interface-compatible use cases supported by WALE_S3_ENDPOINT
        STRUCTURED: time=2019-07-24T05:26:13.366082-00 pid=33
2019-07-24 05:26:14.023 43 LOG Starting pgqd 3.3
2019-07-24 05:26:14.023 43 LOG auto-detecting dbs ...
wal_e.operator.backup INFO     MSG: start upload postgres version metadata
        DETAIL: Uploading to s3://project/spilo/project-postgres-cluster/wal/basebackups_005/base_000000010000000000000003_00000208/extended_version.txt
.
        STRUCTURED: time=2019-07-24T05:26:14.282007-00 pid=33
wal_e.operator.backup WARNING  MSG: blocking on sending WAL segments
        DETAIL: The backup was not completed successfully, but we have to wait anyway.  See README: TODO about pg_cancel_backup
        STRUCTURED: time=2019-07-24T05:26:14.382303-00 pid=33

Configuration I have so far:

# used in the main configmap (the default acid minimal cluster using the latest images). key:   pod_environment_configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: project-postgres-pod-config
  namespace: default
data:
  WAL_S3_BUCKET: project
  # AWS_REGION: eu-central-1
  AWS_ENDPOINT: https://fra1.digitaloceanspaces.com
  AWS_ACCESS_KEY_ID: REPLACE_AWS_ACCESS_KEY_ID
  AWS_SECRET_ACCESS_KEY: REPLACE_AWS_SECRET_ACCESS_KEY

I will try and use a AWS bucket instead of using Digital Ocean spaces to zero things out. However I feel it can be very helpful if there was some documentation about the subject.

I just read about mounting a secret with aws/gcp credentials. One can define this in the CRD configuration file.

# additional_secret_mount: "some-secret-name"
    # additional_secret_mount_path: "/some/dir"

What should be the content of the secret? Also the mount path doesn't make allot of sense to me. Is there any documentation on this? My assumption would be that the additional_secret_mount would point to a kubernetes secret containing the following data:
aws_access_key_id: 'key_id_here'
aws_secret_access_key: 'access_token_here'
and the additional_secret_mount would be:
~/.aws
Then based on this assumption, the operator should generate a file named credentials looking like this:

[default]
aws_access_key_id = key_id_here
aws_secret_access_key = access_token_here
@FxKu
Copy link
Member

FxKu commented Jul 24, 2019

The configuration parameters are explained in the admin docs.

Afaik, logical backup only support S3 API for now. Haven't configured Wal-E backups myself yet, so somebody else might help. So no backup was taken? I see no explicit error in the log snippet.

@Bram-Zijp
Copy link
Author

Bram-Zijp commented Jul 24, 2019

I do not see any files being uploaded to the bucket. This appears lower down the logs every x seconds:

WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (3840 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all t
he WAL segments.

Edit:
I think this line:
wal_e.operator.backup WARNING MSG: blocking on sending WAL segments
is an indication that the connection to the bucket is not working properly.

@Bram-Zijp
Copy link
Author

Bram-Zijp commented Jul 24, 2019

I created an repo containing my setup. https://gitlab.com/bramzijp/postgres-operator-example

@Bram-Zijp
Copy link
Author

The logical backups bring up this error:

botocore.exceptions.NoCredentialsError: Unable to locate credentials
2019-07-26 12:07:47,995 - ThreadPoolExecutor-0_0 - s3transfer.utils - DEBUG - Releasing acquire 0/None
upload failed: - to s3://project-name/spilo/project-name-postgres-cluster/366c9d4d-af9d-11e9-8fc3-f64252eaa693/logical_backups/1564142867.sql.gz Unable to locate credentials

This can probably be solved by creating a custom docker image which have the credentials baked in.

@ggstalder
Copy link

This PR seems to simplify / clarify the process of setting up the S3 logical backup: #628

Side note, as I'm playing around with the operator, I agree that the current documentation is unclear regarding how and what to setup to configure S3 for the logical backups.

@Bram-Zijp
Copy link
Author

@ggstalder I ended up creating my own docker container for logical backups. It wasn't to hard to make using the dockerfile for logical backups supplied by Zalando. However, this is not possible for WAL-E backups.

@Bram-Zijp
Copy link
Author

Bram-Zijp commented Aug 28, 2019

Here's how I did it (note that it delete's backups older then 30 days):

Dockerfile
FROM ubuntu:18.04
LABEL maintainer="Team ACID @ Zalando <[email protected]>"

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
RUN apt-get update     \
  && apt-get install --no-install-recommends -y \
  apt-utils \
  ca-certificates \
  lsb-release \
  pigz \
  python3-pip \
  python3-setuptools \
  curl \
  jq \
  gnupg \
  && pip3 install --no-cache-dir awscli --upgrade \
  && echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list \
  && cat /etc/apt/sources.list.d/pgdg.list \
  && curl --silent https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - \
  && apt-get update \
  && apt-get install --no-install-recommends -y  \
  postgresql-client-11  \
  postgresql-client-10  \
  postgresql-client-9.6 \
  postgresql-client-9.5 \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

COPY dump.sh ./

COPY credentials credentials
RUN mkdir $HOME/.aws
RUN mv credentials $HOME/.aws/credentials

ENV PG_DIR=/usr/lib/postgresql/

ENTRYPOINT ["/dump.sh"]
credentials
[default]
aws_access_key_id = REPLACE_AWS_ACCESS_KEY_ID
aws_secret_access_key = REPLACE_AWS_SECRET_ACCESS_KEY
dump.sh
#! /usr/bin/env bash
# enable unofficial bash strict mode
set -o errexit
set -o nounset
set -o pipefail
IFS=$'\n\t'

# make script trace visible via `kubectl logs`
set -o xtrace

ALL_DB_SIZE_QUERY="select sum(pg_database_size(datname)::numeric) from pg_database;"
PG_BIN=$PG_DIR/$PG_VERSION/bin
DUMP_SIZE_COEFF=5

TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
K8S_API_URL=https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1
CERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

function estimate_size {
    "$PG_BIN"/psql -tqAc "${ALL_DB_SIZE_QUERY}"
}

function dump {
    # settings are taken from the environment
    "$PG_BIN"/pg_dumpall
}

function compress {
    pigz
}

function aws_upload {
    declare -r EXPECTED_SIZE="$1"

    # mimic bucket setup from Spilo
    # to keep logical backups at the same path as WAL
    # NB: $LOGICAL_BACKUP_S3_BUCKET_SCOPE_SUFFIX already contains the leading "/" when set by the Postgres Operator
    PATH_TO_BACKUP=s3://$LOGICAL_BACKUP_S3_BUCKET"/spilo/"$SCOPE"/logical_backups/"$(date +%s).sql.gz

    if [ -z "$EXPECTED_SIZE" ]; then
        aws s3 cp - "$PATH_TO_BACKUP" --debug --endpoint-url="https://fra1.digitaloceanspaces.com"
    else
        aws s3 cp - "$PATH_TO_BACKUP" --debug --expected-size "$EXPECTED_SIZE" --endpoint-url="https://fra1.digitaloceanspaces.com"
    fi;
    echo $PATH_TO_BACKUP;
}

function get_pods {
    declare -r SELECTOR="$1"

    curl "${K8S_API_URL}/namespaces/${POD_NAMESPACE}/pods?$SELECTOR"        \
        --cacert $CERT                          \
        -H "Authorization: Bearer ${TOKEN}" | jq .items[].status.podIP -r
}

function get_current_pod {
    curl "${K8S_API_URL}/namespaces/${POD_NAMESPACE}/pods?fieldSelector=metadata.name%3D${HOSTNAME}" \
        --cacert $CERT   \
        -H "Authorization: Bearer ${TOKEN}"
}

declare -a search_strategy=(
    list_all_replica_pods_current_node
    list_all_replica_pods_any_node
    get_master_pod
)

function list_all_replica_pods_current_node {
    get_pods "labelSelector=version%3D${SCOPE},spilo-role%3Dreplica&fieldSelector=spec.nodeName%3D${CURRENT_NODENAME}" | head -n 1
}

function list_all_replica_pods_any_node {
    get_pods "labelSelector=version%3D${SCOPE},spilo-role%3Dreplica" | head -n 1
}

function get_master_pod {
    get_pods "labelSelector=version%3D${SCOPE},spilo-role%3Dmaster" | head -n 1
}

CURRENT_NODENAME=$(get_current_pod | jq .items[].spec.nodeName --raw-output)
export CURRENT_NODENAME

for search in "${search_strategy[@]}"; do

    PGHOST=$(eval "$search")
    export PGHOST

    if [ -n "$PGHOST" ]; then
        break
    fi

done


function remove_old_backups {
    before_date=$(date +%Y-%m-%d -d "-30 days")
    
    records=$(aws s3api list-objects --bucket "bucketname" --prefix "spilo/bucketname-postgres-cluster/logical_backups/" --query "Contents[?LastModified<='${before_date}'][].{Key: Key}" --endpoint-url https://fra1.digitaloceanspaces.com)

    for row in $(echo "${records}" | jq -r '.[] | @base64'); do
        _jq() {
            echo ${row} | base64 --decode | jq -r ${1}
        }
        
        filepath=$(_jq '.Key')

        aws s3 rm "s3://bucketname/${filepath}" --endpoint-url https://fra1.digitaloceanspaces.com
    done
}

remove_old_backups

dump | compress | aws_upload $(($(estimate_size) / DUMP_SIZE_COEFF))

@ggstalder
Copy link

ggstalder commented Aug 28, 2019

Nice, thanks for sharing.

As you mentionned in your first comment @Bram-Zijp, I still don't understand the meaning and usage of the following parameters:
additional_secret_mount: "some-secret-name"
additional_secret_mount_path: "/some/dir"

I tried to set up additional_secret_mount to target a secret that I created without noticing any change on any pod (or on the cronjob). Probably need some example / more documentation?

@thepotatocannon
Copy link

@Bram-Zijp Did you manage to restore cluster from a backup also?

@Bram-Zijp
Copy link
Author

Bram-Zijp commented Sep 14, 2019

@thepotatocannon According to this issue, it has to be done manually. I guess one could create an CI/CD pipeline to automate this. For WAL backups, it seems possible to use Point In Time to recovery through the manifest. I will share my solution when I created one.

@fourstepper
Copy link

This is quite a bit outside of the scope of this question, I think, but I noticed that wal-g recently started supporting SSH (sftp) as a target next to S3 and other cloud offerings.

I was wondering:

  1. if this should be tracked in another issue
  2. if postgres-operator could take advantage of this relatively easily
  3. if I am missing something completely and it could be used right now

@FxKu
Copy link
Member

FxKu commented Nov 5, 2021

will close this issue, since docs were improved in 1.6.1.
@fourstepper might be a topic for Spilo

@FxKu FxKu closed this as completed Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants