Add health checks to upgrade playbook #4372

juanvallejo · 2017-06-05T21:35:44Z

begin adding memory and disk-space verification to upgrade process

Begin addressing https://trello.com/c/K93Wzz4u

cc @sosiouxme @brenton @mtnbikenc @rhcarvalho

rhcarvalho

See https://github.com/juanvallejo/openshift-ansible/tree/a4728f1335dc4f5d90ff3c5db976e83a41df571d/playbooks/common/openshift-cluster/upgrades/pre, there are some precondition checking there already.

rhcarvalho · 2017-06-06T14:42:00Z

playbooks/common/openshift-cluster/upgrades/init.yml

+  roles:
+  - openshift_health_checker
+  vars:
+  - r_openshift_health_checker_playbook_context: "upgrade"


Presumably, we need to handle this in

openshift-ansible/roles/openshift_health_checker/callback_plugins/zz_failure_summary.py

Line 98 in 2d4709b

if context in ['pre-install', 'health']:

? /cc @sosiouxme

rhcarvalho · 2017-06-12T13:51:14Z

Does pre/verify_memory_and_diskspace.yml get run automatically?

rhcarvalho · 2017-06-12T13:57:37Z

Does pre/verify_memory_and_diskspace.yml get run automatically?

I bet it doesn't, I grepped for verify_upgrade_targets (because there is a pre/verify_upgrade_targets.yml file), and found multiple places where it is included, for example here:

https://github.com/openshift/openshift-ansible/blob/master/playbooks/common/openshift-cluster/upgrades/v3_3/upgrade.yml#L84

rhcarvalho · 2017-06-12T14:01:10Z

roles/openshift_health_checker/callback_plugins/zz_failure_summary.py

@@ -95,7 +95,7 @@ def _print_check_failure_summary(self, failed_checks, context):
            'Variables can be set in the inventory or passed on the\n'
            'command line using the -e flag to ansible-playbook.\n'
        ).format(playbook=self._playbook_file, checks=checks)
-        if context in ['pre-install', 'health']:
+        if context in ['pre-install', 'pre-upgrade', 'health']:


@sosiouxme when you designed this, do you remember what were the "contexts" in which the summary message below should not be printed?

@rhcarvalho @juanvallejo I guess a comment would have helped. So what prompted this was that if you're running a playbook like byo/config.yml then these checks running may come as a surprise and it needs a little more explanation. If you're running a playbook that's specifically for running checks, it's not about "failing early" as they're the whole point, and so the default preamble above is a little out of place. I could imagine other cases in which we'd like to customize the message or even the outcome based on the user intent/expectations. But for now, there are really just two cases: you're trying to run checks, or you're trying to get something else done and the checks are "in the way".

So a "pre-upgrade" context would mean you're running a playbook specifically to see if your upgrade is gonna work out. An "upgrade" context would mean you're actually trying to run the upgrade.

mtnbikenc

verify_memory_and_diskspace.yml should be called from an include: at the bottom of common/openshift-cluster/upgrades/init.yml. This will call the verification checks after the groups and facts have been established.

mtnbikenc · 2017-06-13T19:00:12Z

playbooks/common/openshift-cluster/upgrades/pre/verify_memory_and_diskspace.yml

@@ -0,0 +1,12 @@
+- name: Verify Host Requirements
+  hosts: OSEv3


This should be oo_all_hosts. OSEv3 should not be called in common playbooks.

juanvallejo · 2017-06-13T19:17:45Z

@mtnbikenc @rhcarvalho thanks for the feedback, review comments addressed

sosiouxme · 2017-06-14T21:21:58Z

playbooks/common/openshift-cluster/upgrades/pre/verify_memory_and_diskspace.yml

+  roles:
+  - openshift_health_checker
+  vars:
+  - r_openshift_health_checker_playbook_context: "pre-upgrade"


So if this is being included in an actual upgrade playbook above, I would say that should be just "upgrade" context. If you want to reuse the playbook it should probably not specify a context, instead relying on including playbooks to set context.

Also I don't like calling this verify_memory_and_diskspace.yml -- I assume we're going to add more checks to the same play in later iterations.

Thanks, will update

rhcarvalho · 2017-06-12T14:01:58Z

playbooks/common/openshift-cluster/upgrades/pre/verify_memory_and_diskspace.yml

+  roles:
+  - openshift_health_checker
+  vars:
+  - r_openshift_health_checker_playbook_context: "pre-upgrade"


I believe we can drop the quotes in this case.

rhcarvalho · 2017-06-12T14:02:59Z

playbooks/common/openshift-cluster/upgrades/pre/verify_memory_and_diskspace.yml

+    args:
+      checks:
+      - disk_availability
+      - memory_availability


Missing newline character.

rhcarvalho · 2017-06-15T12:41:50Z

playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml

+  roles:
+  - openshift_health_checker
+  vars:
+  - r_openshift_health_checker_playbook_context: "upgrade"


We could safely drop the quotes here, just like in all other places in this YAML file.

rhcarvalho · 2017-06-15T12:42:18Z

playbooks/common/openshift-cluster/upgrades/init.yml

@@ -23,3 +23,5 @@
    set_fact:
      os_firewall_use_firewalld: false
    when: "'Active: active' in service_iptables_status.stdout"
+
+- include: ./pre/verify_memory_and_diskspace.yml


This reference is outdated.

rhcarvalho · 2017-06-15T12:45:03Z

playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml

+  - action: openshift_health_check
+    args:
+      checks:
+      - disk_availability


The memory requirements in an upgrade should match the requirements of the newer version... but for disk requirements, we probably have at least two scenarios:

In-place upgrade

The free space requirement should likely be different than for an install, after all there is already data in the cluster (e.g., Docker images and installed RPM packages).

I couldn't find any specific guidance about free disk space in the docs.

Blue-green

I believe the free space requirement for the new hosts should be the same as for a fresh install.

We may need a way to tell when an upgrade is in-place and when it is blue-green.

Maybe we could propagate a context to our checks?

What context do you have in mind? We do have r_openshift_health_checker_playbook_context, but that's manually set from playbooks.

If you're just configuring how the checks run per playbook, you can just add another variable (like r_openshift_health_checker_playbook_context) on the playbook and read that from the check. The only thing that's special about that context var is that it gets recorded in results so that the callback plugin can see it and adjust output.

@sosiouxme Thanks, although I don't suppose there is a straightforward way of determining the type of upgrade that's taking place? All I have found was https://docs.openshift.org/latest/install_config/upgrading/automated_upgrades.html#install-config-upgrading-automated-upgrades

@mtnbikenc wondering if you have any thoughts on this as well?

@juanvallejo Does a different type of upgrade use a different playbook, or just different parameters? Either way you have something to key off of.

If I understand https://docs.openshift.org/latest/install_config/upgrading/blue_green_deployments.html#blue-green-deployments-preparing-for-upgrade correctly, it looks like in the case of a blue-green upgrade, an in-place upgrade is still run on existing masters and etcd hosts, but not on existing nodes. Instead, existing nodes are labeled with color=blue (or whatever value a user decides on). Then, entirely new nodes are created with the new version of OpenShift using a playbook. Once green nodes are ready, pods are evacuated from blue nodes to green.

Therefore, I think that if the upgrade context is set, and the host is in the nodes group, it is safe to assume that it is going through an in-place upgrade (since a blue-green upgrade would be creating entirely new nodes instead). masters and etcd hosts can be treated as in-place for both cases.

Sounds right.
So essentially there are just two cases, install or upgrade.

Will push updated disk_availability check that takes the upgrade context into account; halving the required disk space in that case

juanvallejo · 2017-06-22T19:31:37Z

@sosiouxme @rhcarvalho @mtnbikenc wondering your thoughts on 7bd04ce

sosiouxme · 2017-06-22T21:57:24Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+        # in use by the existing OpenShift deployment.
+        context = get_var(task_vars, "r_openshift_health_checker_playbook_context", default="")
+        if context == "upgrade":
+            min_free_bytes /= 2


Seems... kinda arbitrary. I think in this context, if we run the check at all, it should just be checking that they have enough to run the upgrade. Not sure what that should be, maybe like 1GB? Or, this could be re-purposed as a "health" check when it's not in a fresh install context.

Not able to find a definite requirement in the docs for disk size during an upgrade.

Or, this could be re-purposed as a "health" check when it's not in a fresh install context.

That could also work. We could fail the check in this context if disk usage is something like > 90%

Percentage is not good. 90% usage of a 1TB disk means 100GB free -- more than enough for an install/upgrade. 90% usage of 20GB means 2GB free, risky.

Still unclear on a minimum disk availability value under "upgrade" context. Updated check to require at least 5.0 GB for in-place upgrades, for now

sosiouxme · 2017-06-22T22:02:32Z

roles/openshift_health_checker/callback_plugins/zz_failure_summary.py

@@ -95,7 +95,7 @@ def _print_check_failure_summary(self, failed_checks, context):
            'Variables can be set in the inventory or passed on the\n'
            'command line using the -e flag to ansible-playbook.\n'
        ).format(playbook=self._playbook_file, checks=checks)
-        if context in ['pre-install', 'health']:
+        if context in ['pre-install', 'upgrade', 'health']:


This is the summary for playbooks where the user is trying to run checks. I wouldn't include "upgrade" here so they can have the previous msg with a little more explanation. If there were a "pre-upgrade" playbook just for running checks then I would put it here.

Sounds good, will remove

rhcarvalho · 2017-06-26T10:11:37Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+                    'failed': True,
+                    'msg': (
+                        'Available disk space ({:.1f} GB) for the volume containing '
+                        '"/var" is below minimum recommended space for an upgrade ({:.1f} GB).'


This introduces some duplication with lines coming right below.

Not sure what you mean. The return statement inside this block prevents the text below from being displayed if this one is.

rhcarvalho · 2017-06-26T10:12:00Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+        # in use by the existing OpenShift deployment.
+        context = get_var(task_vars, "r_openshift_health_checker_playbook_context", default="")
+        if context == "upgrade":
+            upgrade_min_required_diskspace = 5.0 * 10**9


We need to have in mind that both this and #4436 change the disk availability check. In an upgrade, this change of minimum requirements, whatever it end up being, will possibly only apply to /var, while /usr/local/bin and /tmp requirements would be left unchanged, agree?

I'm fine with waiting for #4436 to merge, then adding the decreased upgrade requirement on top of that

rhcarvalho · 2017-06-26T13:30:15Z

playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml

@@ -0,0 +1,13 @@
+---
+- name: Verify Host Requirements
+  hosts: oo_all_hosts


@sosiouxme shall we limit this also to run only on masters and nodes, or is it enough to filter it on the action plugin level?

@rhcarvalho sorry i missed this. no, it looks like oo_all_hosts comes from g_all_hosts which is masters/nodes/etcd only so this should be fine. all our playbooks basically need to work the same as byo/config.yml with respect to groups.

juanvallejo · 2017-06-30T17:50:57Z

@sosiouxme @rhcarvalho rebased with changes introduced to disk_availability.py in #4436
PTAL

juanvallejo · 2017-06-30T18:10:09Z

aos-ci-test

openshift-bot · 2017-06-30T22:34:24Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for e840183 (logs)

openshift-bot · 2017-06-30T22:38:09Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for e840183 (logs)

openshift-bot · 2017-06-30T22:39:48Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for e840183 (logs)

openshift-bot · 2017-06-30T22:43:47Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for e840183 (logs)

rhcarvalho

We may still be able to simplify a bit, but otherwise looking good.

rhcarvalho · 2017-07-02T19:07:05Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+                    # if not able to resolve a recommended value, default to non-upgrade recommendation
+                    min_upgrade_rec = max(recommendation.get(name, 0) for name in group_names)
+
+                if free_bytes < min_upgrade_rec:


I feel like this duplicates logic unnecessarily. Couldn't we simple use the upgrade values when it's an upgrade and then use the same code paths to compare free / required?

rhcarvalho · 2017-07-02T19:08:03Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+        tempfile.gettempdir(): {
+            'masters': 0.5 * 10**9,
+            'nodes': 0.5 * 10**9,
+            'etcd': 0.5 * 10**9,


Granted the 1 GB for /usr/local/bin and temp were both originally a rough guess, why do we make this different here?

Will update to just reduce space requirements for /var

rhcarvalho · 2017-07-02T19:10:45Z

roles/openshift_health_checker/test/disk_availability_test.py

+        ["20.0 GB", "below minimum"],
+    ),
+])
+def test_min_required_space_decreases_with_upgrade_context(group_names, context, ansible_mounts, failed, extra_words):


Here and other places we say "decrease" or otherwise give suggestions as to what the values are compared to "non-upgrade". I'd avoid doing that type of comment or naming, because what governs the relationship of the numbers is actual data. Change the data and all the names become misleading.

Likewise, on an extreme, it is considered bad practice to name variables like two = 2 only to see two = 3 some day in the future...

mtnbikenc

We will be refactoring init.yml for upgrades but the playbook flow looks good for now.

rhcarvalho · 2017-07-25T14:25:19Z

thanks @mtnbikenc

rhcarvalho · 2017-07-25T14:25:26Z

aos-ci-test

openshift-bot · 2017-07-25T16:04:01Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 577fdc0 (logs)

openshift-bot · 2017-07-25T16:06:41Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 577fdc0 (logs)

juanvallejo · 2017-07-26T15:26:35Z

@rhcarvalho ok to merge?

rhcarvalho · 2017-07-27T12:15:09Z

@juanvallejo yes, we've got plenty of approval :-) time for the CI dance

rhcarvalho · 2017-07-27T12:15:19Z

aos-ci-test

openshift-bot · 2017-07-27T13:12:51Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 8af1839 (logs)

openshift-bot · 2017-07-27T13:16:24Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 8af1839 (logs)

juanvallejo · 2017-07-27T17:21:16Z

@rhcarvalho ci tests are green, will tag for [merge]

openshift-bot · 2017-07-27T17:23:32Z

[test]ing while waiting on the merge queue

openshift-bot · 2017-07-27T17:31:33Z

Evaluated for openshift ansible test up to 8af1839

openshift-bot · 2017-07-27T19:15:31Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/370/) (Base Commit: cf80c7e) (PR Branch Commit: 8af1839)

juanvallejo · 2017-07-27T21:55:35Z

flaked on openshift/origin#10162 re[merge]

rhcarvalho · 2017-07-28T07:56:21Z

Flake openshift/origin#15356, perhaps fixed in openshift/origin#15482.

And apparently something new, openshift/origin#15522.

[merge] one last time.

openshift-bot · 2017-07-28T08:07:26Z

Evaluated for openshift ansible merge up to 8af1839

openshift-bot · 2017-07-28T10:07:23Z

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/754/) (Base Commit: 4a442d8) (PR Branch Commit: 8af1839)

rhcarvalho · 2017-07-28T11:40:46Z

Flake openshift/origin#15522 again, proceeding to manual merge as per https://github.com/openshift/openshift-ansible/blob/master/docs/pull_requests.md#manual-merges

rhcarvalho · 2017-07-28T11:49:06Z

@juanvallejo please review https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/754/ and open a new PR.

rhcarvalho · 2017-07-28T11:50:29Z

I made a mistake and merged when the error was actually related to the changes introduced by this PR. Subsequently reverted the changes.

rhcarvalho reviewed Jun 6, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from a4728f1 to 4b4e609 Compare June 6, 2017 16:42

rhcarvalho reviewed Jun 12, 2017

View reviewed changes

mtnbikenc requested changes Jun 13, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 4b4e609 to 0621520 Compare June 13, 2017 19:16

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 0621520 to 1278b26 Compare June 14, 2017 21:11

sosiouxme reviewed Jun 14, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch 2 times, most recently from a11a0e5 to 73e0a2e Compare June 14, 2017 21:34

rhcarvalho suggested changes Jun 15, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 73e0a2e to c56f15d Compare June 15, 2017 22:13

sosiouxme reviewed Jun 22, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch 2 times, most recently from 0ca870e to 44fef09 Compare June 23, 2017 20:21

rhcarvalho reviewed Jun 26, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch 2 times, most recently from eeac970 to 0b4d699 Compare June 30, 2017 17:50

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 0b4d699 to e840183 Compare June 30, 2017 17:52

rhcarvalho reviewed Jul 2, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 071069a to 577fdc0 Compare July 21, 2017 20:06

mtnbikenc approved these changes Jul 25, 2017

View reviewed changes

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2017

juanvallejo added 2 commits July 26, 2017 19:00

add pre-flight checks to ugrade path

5d03e56

fixes after rebasing with openshift#4485

8af1839

juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 577fdc0 to 8af1839 Compare July 26, 2017 23:04

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2017

rhcarvalho mentioned this pull request Jul 28, 2017

Verify upgrade can proceed on first master.[localhost] fail openshift/origin#15522

Closed

rhcarvalho merged commit d0761ae into openshift:master Jul 28, 2017

rhcarvalho mentioned this pull request Jul 28, 2017

Revert "Add health checks to upgrade playbook" #4923

Merged

		@@ -0,0 +1,12 @@
		- name: Verify Host Requirements
		hosts: OSEv3

Add health checks to upgrade playbook #4372

Add health checks to upgrade playbook #4372

Conversation

juanvallejo commented Jun 5, 2017 • edited Loading

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho commented Jun 12, 2017

rhcarvalho commented Jun 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtnbikenc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jul 2, 2017 • edited Loading

Choose a reason for hiding this comment

juanvallejo commented Jun 30, 2017

juanvallejo commented Jun 30, 2017

openshift-bot commented Jun 30, 2017

openshift-bot commented Jun 30, 2017

openshift-bot commented Jun 30, 2017

openshift-bot commented Jun 30, 2017

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtnbikenc left a comment

Choose a reason for hiding this comment

rhcarvalho commented Jul 25, 2017

rhcarvalho commented Jul 25, 2017

openshift-bot commented Jul 25, 2017

openshift-bot commented Jul 25, 2017

juanvallejo commented Jul 26, 2017

rhcarvalho commented Jul 27, 2017

rhcarvalho commented Jul 27, 2017

openshift-bot commented Jul 27, 2017

openshift-bot commented Jul 27, 2017

juanvallejo commented Jul 27, 2017 • edited Loading

openshift-bot commented Jul 27, 2017

openshift-bot commented Jul 27, 2017

openshift-bot commented Jul 27, 2017

juanvallejo commented Jul 27, 2017

rhcarvalho commented Jul 28, 2017

openshift-bot commented Jul 28, 2017

openshift-bot commented Jul 28, 2017

rhcarvalho commented Jul 28, 2017

rhcarvalho commented Jul 28, 2017

rhcarvalho commented Jul 28, 2017

juanvallejo commented Jun 5, 2017 •

edited

Loading

rhcarvalho commented Jun 12, 2017 •

edited

Loading

sosiouxme Jul 2, 2017 •

edited

Loading

juanvallejo commented Jul 27, 2017 •

edited

Loading