Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backup restore #2161

Merged
merged 7 commits into from
Jan 23, 2020
Merged

Fix backup restore #2161

merged 7 commits into from
Jan 23, 2020

Conversation

TeddyAndrieux
Copy link
Collaborator

@TeddyAndrieux TeddyAndrieux commented Dec 20, 2019

Component:

'backup', 'restore'

Context:

Summary:

  • Add abilty to provide apiserver ip from pillar to configure apiserver proxy and admin kubeconfig
  • Change backup archive layout
  • Add some metadata + integrity check for backup archive
  • Add apiserver_ip arg in restore script

Acceptance criteria:

Working backup restore procedure

TODO:

  • Update backup restore documentation

Fixes: #2141
Fixes: #2142
Fixes: #2143
Fixes: #2157

@TeddyAndrieux TeddyAndrieux requested a review from a team December 20, 2019 17:17
@bert-e
Copy link
Contributor

bert-e commented Dec 20, 2019

Hello teddyandrieux,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Dec 20, 2019

Conflict

A conflict has been raised during the creation of
integration branch w/2.5/bugfix/fix-backup-restore with contents from bugfix/fix-backup-restore
and development/2.5.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/2.5/bugfix/fix-backup-restore origin/development/2.5
 $ git merge origin/bugfix/fix-backup-restore
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/2.5/bugfix/fix-backup-restore

@TeddyAndrieux TeddyAndrieux changed the title Bugfix/fix backup restore Fix backup restore Dec 20, 2019
@TeddyAndrieux TeddyAndrieux force-pushed the bugfix/fix-backup-restore branch 2 times, most recently from ceb28f7 to 7e0e729 Compare December 23, 2019 17:45
@TeddyAndrieux
Copy link
Collaborator Author

/help

@bert-e
Copy link
Contributor

bert-e commented Dec 23, 2019

Help page

The following options and commands are available at this time.

Options

name description privileged authored
➡️ create_pull_requests Allow the creation of integration pull requests.
➡️ bypass_author_approval Bypass the pull request author's approval
➡️ bypass_commit_size Bypass the check on the size of the changeset TBA
➡️ unanimity Change review acceptance criteria from one reviewer at least to all reviewers
➡️ bypass_incompatible_branch Bypass the check on the source branch prefix
➡️ bypass_build_status Bypass the build and test status
➡️ approve Instruct Bert-E that the author has approved the pull request. ✍️
➡️ after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
➡️ wait Instruct Bert-E not to run until further notice.
➡️ bypass_jira_check Bypass the Jira issue check
➡️ no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
➡️ bypass_leader_approval Bypass the pull request leaders' approval
➡️ bypass_peer_approval Bypass the pull request peers' approval

Commands

name description privileged
➡️ reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.
➡️ help Print Bert-E's manual in the pull request.
➡️ build Re-start a fresh build TBA
➡️ force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
➡️ retry Re-start a fresh build TBA
➡️ status Print Bert-E's current status in the pull request TBA
➡️ clear Remove all comments from Bert-E from the history TBA

@bert-e
Copy link
Contributor

bert-e commented Dec 23, 2019

Conflict

A conflict has been raised during the creation of
integration branch w/2.5/bugfix/fix-backup-restore with contents from bugfix/fix-backup-restore
and development/2.5.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/2.5/bugfix/fix-backup-restore origin/development/2.5
 $ git merge origin/bugfix/fix-backup-restore
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/2.5/bugfix/fix-backup-restore

@TeddyAndrieux TeddyAndrieux marked this pull request as ready for review December 23, 2019 17:50
@bert-e
Copy link
Contributor

bert-e commented Dec 23, 2019

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

@TeddyAndrieux
Copy link
Collaborator Author

/after_pull_request=2147

@bert-e
Copy link
Contributor

bert-e commented Dec 23, 2019

Waiting for other pull request(s)

The current pull request is locked by the after_pull_request option.

In order for me to merge this pull request, run the following actions first:

➡️ Merge the OPEN pull request:

Alternatively, delete all the after_pull_request comments from this pull request.

The following options are set: after_pull_request

@TeddyAndrieux
Copy link
Collaborator Author

/approve

@bert-e
Copy link
Contributor

bert-e commented Dec 23, 2019

Waiting for other pull request(s)

The current pull request is locked by the after_pull_request option.

In order for me to merge this pull request, run the following actions first:

➡️ Merge the OPEN pull request:

Alternatively, delete all the after_pull_request comments from this pull request.

The following options are set: approve, after_pull_request

Copy link
Contributor

@gdemonet gdemonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of it looks good to me, will test :)

scripts/restore.sh.in Outdated Show resolved Hide resolved
docs/operation/bootstrap_backup_restore.rst Outdated Show resolved Hide resolved
@bert-e
Copy link
Contributor

bert-e commented Dec 31, 2019

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

The following options are set: approve, after_pull_request

@alexandre-allard
Copy link
Contributor

alexandre-allard commented Jan 6, 2020

I don't know if it's a PEBKAC issue (or even a flaky), but I had to add the new etcd member by hand to complete the bootstrap node restoration.
What did I do (on a 2.4.2-dev, 2 nodes + bootstrap cluster) ?

  • Deleted the boostrap host
  • Removed it from etcd members
crictl exec -i "f4ea35c86078e" sh -c "ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --cacert /etc/kubernetes/pki/etcd/ca.crt member remove 77170215bd1dd1d4"
  • Spawned a new bootstrap host
  • Launched restore.sh script (script failed on metalk8s.orchestrate.deploy_node, not able to get an answer from the API server):
/var/tmp/metalk8s/restore.sh -b /tmp/backup_20200106_105508.tar.gz -i 192.168.1.5 -v

Here is the output of the script:

[...]
[ERROR   ] {u'return': {u'outputter': u'highstate', u'data': {u'metalk8s-tznqr-ootstrap.novalocal_master': [u'Pillar failed to render with the following messages:', u"Failed to load ext_pillar metalk8s_nodes: HTTPSConnectionPool(host='192.168.1.11', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))"]}, u'retcode': 1}}
metalk8s-tznqr-bootstrap.novalocal_master:
  Name: Set grains - Function: salt.state - Result: Clean Started: - 09:30:10.137568 Duration: 4374.543 ms
  Name: mine.update - Function: salt.function - Result: Changed Started: - 09:30:14.512459 Duration: 768.186 ms
  Name: metalk8s-tznqr-bootstrap.novalocal - Function: metalk8s_cordon.node_cordoned - Result: Changed Started: - 09:30:15.282085 Duration: 90.562 ms
  Name: metalk8s-tznqr-bootstrap.novalocal - Function: metalk8s_drain.node_drained - Result: Changed Started: - 09:30:15.374478 Duration: 4431.238 ms
  Name: saltutil.sync_all - Function: salt.function - Result: Changed Started: - 09:30:19.805975 Duration: 927.459 ms
  Name: metalk8s.check_pillar_keys - Function: salt.function - Result: Changed Started: - 09:30:20.734152 Duration: 30772.793 ms
  Name: Reconfigure salt-minion - Function: salt.state - Result: Changed Started: - 09:30:51.545540 Duration: 12682.587 ms
  Name: metalk8s_saltutil.wait_minions - Function: salt.runner - Result: Changed Started: - 09:31:04.228859 Duration: 2894.404 ms
  Name: Run the highstate - Function: salt.state - Result: Changed Started: - 09:31:07.124583 Duration: 42952.273 ms
----------
          ID: Wait for API server to be available
    Function: http.wait_for_successful_query
        Name: https://127.0.0.1:7443/healthz
      Result: False
     Comment: An exception occurred in this state: Traceback (most recent call last):
                File "/usr/lib/python2.7/site-packages/salt/state.py", line 1919, in call
                  **cdata['kwargs'])
                File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1918, in wrapper
                  return f(*args, **kwargs)
                File "/usr/lib/python2.7/site-packages/salt/states/http.py", line 163, in wait_for_successful_query
                  raise caught_exception  # pylint: disable=E0702
              SSLEOFError: EOF occurred in violation of protocol (_ssl.c:618)
     Started: 09:31:50.079001
    Duration: 300002.151 ms
     Changes:
----------
          ID: Uncordon the node
    Function: metalk8s_cordon.node_uncordoned
        Name: metalk8s-tznqr-bootstrap.novalocal
      Result: False
     Comment: One or more requisite failed: metalk8s.orchestrate.deploy_node.Wait for API server to be available
     Started: 09:36:50.082547
    Duration: 0.019 ms
     Changes:
  Name: ps.pkill - Function: salt.function - Result: Changed Started: - 09:36:50.082811 Duration: 376.399 ms
----------
          ID: Register the node into etcd cluster
    Function: salt.runner
        Name: state.orchestrate
      Result: False
     Comment: Runner function 'state.orchestrate' failed.
     Started: 09:36:50.459635
    Duration: 9033.168 ms
     Changes:
              ----------
              return:
                  ----------
                  data:
                      ----------
                      metalk8s-tznqr-bootstrap.novalocal_master:
                          - Pillar failed to render with the following messages:
                          - Failed to load ext_pillar metalk8s_nodes: HTTPSConnectionPool(host='192.168.1.11', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))
                  outputter:
                      highstate
                  retcode:
                      1

Summary for metalk8s-tznqr-bootstrap.novalocal_master
-------------
Succeeded: 10 (changed=8)
Failed:     3
-------------
Total states run:     13
Total run time:  409.306 s

In etcd container logs, I have:

etcdmain: error validating peerURLs {ClusterID:e8b3e49ae5233ad1 Members:[&{ID:df96615de5709ef8 RaftAttributes:{PeerURLs:[https://192.168.1.32:2380]} Attributes:{Name:metalk8s-tznqr-node-2.novalocal ClientURLs:[https://192.168.1.32:2379]}} &{ID:e9f15dbabe9226d5 RaftAttributes:{PeerURLs:[https://192.168.1.5:2380]} Attributes:{Name:metalk8s-tznqr-node-1.novalocal ClientURLs:[https://192.168.1.5:2379]}}] RemovedMemberIDs:[]}: member count is unequal
  • Tried to relaunch, still failing, even earlier.
  • Added the etcd member by hand (on a healthy node)
crictl exec -i "f4ea35c86078e" sh -c "ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --cacert /etc/kubernetes/pki/etcd/ca.crt member add metalk8s-tznqr-bootstrap.novalocal --peer-urls=https://192.168.1.11:2380"
  • Relaunched the restore script (everything went fine this time and the node is correctly restored).

Am I missing a step or something ?
Maybe it's not related to this PR, but an issue during etcd expansion or something like that.

@gdemonet gdemonet force-pushed the bugfix/fix-backup-restore branch from 7e0e729 to f68a05b Compare January 17, 2020 10:01
@bert-e
Copy link
Contributor

bert-e commented Jan 17, 2020

History mismatch

Merge commit #bc1e12933d05b44283ebcf6fff18face66d09be6 on the integration branch
w/2.5/bugfix/fix-backup-restore is merging a branch which is neither the current
branch bugfix/fix-backup-restore nor the development branch
development/2.5.

It is likely due to a rebase of the branch bugfix/fix-backup-restore and the
merge is not possible until all related w/* branches are deleted or updated.

Please use the reset command to have me reinitialize these branches.

The following options are set: approve, after_pull_request

@TeddyAndrieux TeddyAndrieux force-pushed the bugfix/fix-backup-restore branch from f68a05b to 7bb37bc Compare January 20, 2020 17:41
This reverts commit 3f610a5.
This change are not needed as we already backup all `/etc/metalk8s`
directory that contains all the "MetalK8s CAs" in `/etc/metalk8s/pki`

Fixes: #2141
Move kubernetes pki from `./pki` to `./kubernetes/pki` in backup archive
to reflect machine layout

Fixes: #2142
Add some metadata information in the backup archive and a `sha256sum` of
all the file, and check integrity of the backup archive at the beginning
of the restore script

Fixes: #2143
For restore we may want to specify an external apiserver IP for
apiserver proxy and admin kubeconfig
To properly configure the new bootstrap node we need to have mine
informations from other nodes (like control plane and workload plane
IPs), to be able to update the mine we need that all minions ready and
having correct pillar
docs/operation/bootstrap_backup_restore.rst Outdated Show resolved Hide resolved
You cannot use the restore script if you do not have High Availability
apiserver because some information required to reconfigure the others
nodes are stored in the apiserver.

.. warning::

In case of a 3-node etcd cluster (2 nodes + unreachable old bootstrap node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, maybe it should be an optional step in the clear, with the required command(s) and expected output. Too easy to go over it and have weird errors in the script's output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done but command are quite ugly we could do something smarter if we add a salt module to remove an etcd member but not part of this PR IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, would be nice to provide a user-friendly wrapper to etcdctl, maybe as a Salt module or in a kubectl plugin, not sure. For now, having the "ugly" command for copy-pasting is good enough!

@TeddyAndrieux TeddyAndrieux force-pushed the bugfix/fix-backup-restore branch from 7bb37bc to 09939e4 Compare January 21, 2020 13:22
@TeddyAndrieux
Copy link
Collaborator Author

/reset

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Reset complete

I have successfully deleted this pull request's integration branches.

The following options are set: approve, after_pull_request

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Conflict

A conflict has been raised during the creation of
integration branch w/2.5/bugfix/fix-backup-restore with contents from bugfix/fix-backup-restore
and development/2.5.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/2.5/bugfix/fix-backup-restore origin/development/2.5
 $ git merge origin/bugfix/fix-backup-restore
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/2.5/bugfix/fix-backup-restore

The following options are set: approve, after_pull_request

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

The following options are set: approve, after_pull_request

Copy link
Contributor

@gdemonet gdemonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some docs changes, otherwise LGTM

docs/operation/bootstrap_backup_restore.rst Outdated Show resolved Hide resolved
docs/operation/bootstrap_backup_restore.rst Outdated Show resolved Hide resolved
@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

The following reviewers are expecting changes from the author, or must review again:

The following options are set: approve, after_pull_request

@TeddyAndrieux TeddyAndrieux force-pushed the bugfix/fix-backup-restore branch from bb83bb2 to 22d4188 Compare January 22, 2020 10:54
@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

History mismatch

Merge commit #09939e4df6c156c48413af0f463e514a8dac89ee on the integration branch
w/2.5/bugfix/fix-backup-restore is merging a branch which is neither the current
branch bugfix/fix-backup-restore nor the development branch
development/2.5.

It is likely due to a rebase of the branch bugfix/fix-backup-restore and the
merge is not possible until all related w/* branches are deleted or updated.

Please use the reset command to have me reinitialize these branches.

The following options are set: approve, after_pull_request

Since #2103 we not longer use VIP for apiserver so we need an IP of one
apiserver to be able to register the new bootstrap node to the current
kubernetes cluster

Fixes: #2157
During restore we need to sync_auth to have a working salt-api, also
need to reconfigure the control plane nginx ingress to set the new
external IP used to access the UI
@TeddyAndrieux
Copy link
Collaborator Author

/reset

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Reset complete

I have successfully deleted this pull request's integration branches.

The following options are set: approve, after_pull_request

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Conflict

A conflict has been raised during the creation of
integration branch w/2.5/bugfix/fix-backup-restore with contents from bugfix/fix-backup-restore
and development/2.5.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/2.5/bugfix/fix-backup-restore origin/development/2.5
 $ git merge origin/bugfix/fix-backup-restore
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/2.5/bugfix/fix-backup-restore

The following options are set: approve, after_pull_request

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

The following reviewers are expecting changes from the author, or must review again:

The following options are set: approve, after_pull_request

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

Build failed

The build for commit did not succeed in branch w/2.5/bugfix/fix-backup-restore.

The following options are set: approve, after_pull_request

SALT_MASTER_CALL=(crictl exec -i "$(get_salt_container)")

"${SALT_MASTER_CALL[@]}" salt-run --state-output=mixed state.orchestrate \
metalk8s.addons.nginx-ingress-control-plane.deployed \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could simply run metalk8s.deployed here? Being idempotent and only executed on the master, I'm thinking it's some kind of "K8s highstate"... If you think so, we could do with a debt ticket to fix in 2.5 :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since for the moment we basically replace every object I think it's not a good idea because we will lose all tuning done by the user on k8s objects when restoring which is unexpected IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let's wait for #1996 then

@bert-e
Copy link
Contributor

bert-e commented Jan 22, 2020

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.4

  • ✔️ development/2.5

The following branches will NOT be impacted:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve, after_pull_request

@bert-e
Copy link
Contributor

bert-e commented Jan 23, 2020

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.4

  • ✔️ development/2.5

The following branches have NOT changed:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3

Please check the status of the associated issue None.

Goodbye teddyandrieux.

@bert-e bert-e merged commit 22d4188 into development/2.4 Jan 23, 2020
@bert-e bert-e deleted the bugfix/fix-backup-restore branch January 23, 2020 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants