Etcd backup operator seem to miss schedule if operator pod/container is restarted #54

JustasStankunas · 2020-01-27T12:56:47Z

Environment:

K8s is running within Azure.
We have set up a 3 node etcd cluster and set 3 backups (hourly, daily, weekly) with backup directly to Azure blob storage.

What is observed:
Looking at the backup history in the Azure there are gaps in the backup cycle. These gaps are mostly visible with longer backup cycles.

When looked at etcd-backup-operator pod logs there are multiple restart events within timeframe of the missing backups. If I correctly understood restarts were happening due to etcd leader election or something like that.

To validate my suspicions I have set the following script to kill the backup operator pod and later only the container and set it via Cron to happen every 10 minutes. I have set the backup every 20 minutes. As a result backup was never done since 04:39 UTC time, when I started to experiment. Well after 6 restarts pod got into Error state. I will try to continue with less aggressive restart cron schedule to see if that has impact.

Expected result:

Backup is happening according to the schedule regardless of container restarts. Schedule timer should not be linked to container lifetime as container may die any time. Or is it a feature due to the way Kubernetes works?

Script:

#!/bin/bash

cd /root
date +"%Y %m %d - %H:%M" 2>&1 >> kill-operator.log
/usr/local/bin/kubectl -n tep-k8s-test-01 exec -c etcd-backup-operator  $(/usr/local/bin/kubectl -n tep-k8s-test-01 get po -l  name=etcd-backup-operator -o name) -- /bin/kill -5 1  2>&1  >>  kill-operator.log
echo "----" 2>&1 >>  kill-operator.log

Edited backup schedule:

root@atl-cj1-m-ducx:~# kubectl  -n tep-k8s-test-01 describe  EtcdBackup etcd-cluster-backup-weekly
Name:         etcd-cluster-backup-weekly
Namespace:    tep-k8s-test-01
Labels:       <none>
Annotations:  <none>
API Version:  etcd.database.coreos.com/v1beta2
Kind:         EtcdBackup
Metadata:
  Creation Timestamp:  2020-01-15T07:54:50Z
  Finalizers:
    backup-operator-periodic
  Generation:        145
  Resource Version:  81580419
  Self Link:         /apis/etcd.database.coreos.com/v1beta2/namespaces/tep-k8s-test-01/etcdbackups/etcd-cluster-backup-weekly
  UID:               7dd4c2a7-e1e0-4fe1-ae04-100be7ff6d65
Spec:
  Abs:
    Abs Secret:  storage-account-credentials-weekly
    Path:        tep-k8s-test-01/etcd.backup
  Backup Policy:
    Backup Interval In Second:  1200
  Etcd Endpoints:
    http://etcd-cluster-client:2379
  Storage Type:  ABS
Status:
  Etcd Revision:      1098811
  Etcd Version:       3.4.3
  Last Success Date:  2020-01-27T04:39:09Z
  Succeeded:          true
Events:               <none>
root@atl-cj1-m-ducx:~# date
Mon Jan 27 09:05:37 UTC 2020

i am re-posting my colleagues issue in original repo: etc-operator #2152

The text was updated successfully, but these errors were encountered:

tvainutis mentioned this issue Jan 29, 2020

dynamic periodic etcd backup timer feature #56

Merged

JustasStankunas closed this as completed Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd backup operator seem to miss schedule if operator pod/container is restarted #54

Etcd backup operator seem to miss schedule if operator pod/container is restarted #54

JustasStankunas commented Jan 27, 2020

Etcd backup operator seem to miss schedule if operator pod/container is restarted #54

Etcd backup operator seem to miss schedule if operator pod/container is restarted #54

Comments

JustasStankunas commented Jan 27, 2020